NASA-CR-205139 





MPI-IO: A Parallel File I/O Interface for MPI 

Version 0.3 


[NAS Technical Report NAS-95-002 January 1995] 


mpi-io@nas.nasa.gov 


Peter Corbett, Dror Feitelson, Yarsun Hsu, Jean-Pierre Prost, Marc Snir 

IBM T.J. Watson Research Center 
RO. Box 218 

Yorktown Heights, NY 10598 

& 

Sam Fineberg 1 , Bill Nitzberg 1 , Bernard Traversat 1 , Parkson Wong 1 

NAS Systems Division 
NASA Ames Research Center 
Mail Stop 258-6 
Moffett Field, CA 94035-1000 


Typeset on January 30, 1995 


Computer Sciences Corporation, NASA Ames Research Center, under Contract NAS2-12961 




1 


2 

3 

4 

5 

‘ Contents 

8 


10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 


1 Introduction 

1.1 Background 

1.2 Design Goals 

1.3 History 

2 Overview of MPI-IO 

2.1 Data Partitioning in MPI-IO 

2.2 MPI-IO Data Access Functions 

2.3 Offsets and File Pointers 

2.4 End of File 

2.5 Current Proposal and Future Extensions . . . 

3 Interface Definitions and Conventions 

3.1 Independent vs. Collective 

3.2 Blocking vs. Nonblocking 

3.3 Etype, Filetype, Buftype, and Offset Relation 

3.4 Displacement and offset types 

3.5 Return Code and Status 

3.6 Interrupts 

4 File Control 

4.1 Opening a File (Collective) 

4.2 Closing a file (Collective) 

4.3 File Control (Independent/Collective) 

4.4 Deleting a file (Independent) 

4.5 Resizing a file (Collective) 

4.6 File Sync (Collective) 

5 Independent I/O 

5.1 MPIO_Read 

5.2 MPIO_Write 

5.3 MPIOJread 

5.4 MPIOJwrite 

6 Collective I/O 

6.1 MPIO-Read_all 

6.2 MPIO-Write_all 

6.3 MPIO Jread_all 

6.4 MPIO Jwrite_all 

7 File pointers 

7.1 Introduction 

7.2 Shared File Pointer I/O Functions 

7.3 Individual File Pointer Blocking I/O Functions 


1 

1 

2 

3 

3 

4 
6 
7 

7 

8 
9 
9 
9 

10 

10 

10 

10 

11 

11 

12 

13 

14 

14 

15 
15 

15 

16 
16 

17 

18 
18 
18 

19 

20 
20 
20 
21 
24 



7.4 Individual File Pointer Nonblocking I/O Functions 27 1 

7.5 File Pointer Manipulation Functions 30 2 

8 Filetype Constructors 31 

8.1 Introduction 31 

8.2 Broadcast-Read and Write-Reduce Constructors 31 5 

8.3 Scatter / Gather Type Constructors 33 6 

8.4 HPF Filetype Constructors 34 7 

9 Error Handling 37 

9.1 MPIO_Errhandler .create (independent) 38 9 

9.2 MPIO .Errhandler .set (independent) 38 ]0 

9.3 MPIO_Errhandler_get (independent) 39 n 

Bibliography 40 w 

A MPIO.Open File hints 41 13 

B System Support for File Pointers 42 , 4 

B.l Interface Style 42 1S 

B.2 File Pointer Update 42 ]6 

B.3 Collective Operations with Shared File Pointers 43 I7 

C Unix Read/ Write Atomic Semantics 43 lg 

D Filetype Constructors: Sample Implementations and Examples 44 19 

D.l Support Routines 44 20 

D.2 Sample Filetype Constructor Implementations 45 2 i 

D.3 Example: Row block distribution of A[100, 100] 48 22 

D.4 Example: Column block distribution of A[100, 100] 50 23 

D.5 Example: Transposing a 2-D Matrix in a Row-Cyclic Distribution . 51 2 « 

E Justifying Design Decisions 53 25 

26 

27 

26 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 


48 



Draft Document of the MPI-IO Interface, January 30, 1995 


1 


1 Introduction 

Thanks to MPI [9], writing portable message passing parallel programs is almost a reality. 
One of the remaining problems is file I/O. Although parallel file systems support similar 
interfaces, the lack of a standard makes developing a truly portable program impossible. 
Further, the closest thing to a standard, the UNIX file interface, is ill-suited to parallel 
computing. 

Working together, IBM Research and NASA Ames have drafted MPI-IO, a proposal to 
address the portable parallel I/O problem. In a nutshell, this proposal is based on the idea 
that I/O can be modeled as message passing: writing to a file is like sending a message, and 
reading from a file is like receiving a message. MPI-IO intends to leverage the relatively 
wide acceptance of the MPI interface in order to create a similar I/O interface. 

The above approach can be materialized in different ways. The current proposal repre- 
sents the result of extensive discussions (and arguments), but is by no means finished. Many 
changes can be expected as additional participants join the effort to define an interface for 
portable I/O. 

This document is organized as follows. The remainder of this section includes a dis- 
cussion of some issues that have shaped the style of the interface. Section 2 presents an 
overview of MPI-IO as it is currently defined. It specifies what the interface currently sup- 
ports and states what would need to be added to the current proposal to make the interface 
more complete and robust. The next seven sections contain the interface definition itself. 
Section 3 presents definitions and conventions. Section 4 contains functions for file control, 
most notably open. Section 5 includes functions for independent I/O, both blocking and 
nonblocking. Section 6 includes functions for collective I/O, both blocking and nonblock- 
ing. Section 7 presents functions to support system-maintained file pointers, and shared 
file pointers. Section 8 presents constructors that can be used to define useful filetypes (the 
role of filetypes is explained in Section 2 below). Section 9 presents how the error handling 
mechanism of MPI is supported by the MPI-IO interface. All this is followed by a set of 
appendices, which contain information about issues that have not been totally resolved yet, 
and about design considerations. The reader can find there the motivation behind some 
of our design choices. More information on this would definitely be welcome and will be 
included in a further release of this document. The first appendix contains a description of 
MPI-IO ’s “hints” structure which is used when opening a file. Appendix B is a discussion of 
various issues in the support for file pointers. Appendix C explains what we mean in talking 
about atomic access. Appendix D provides detailed examples of filetype constructors, and 
Appendix E contains a collection of arguments for and against various design decisions. 

1.1 Background 

The main deficiency of Unix I/O in the context of parallel computing is that Unix is designed 
first and foremost for an environment where files are not shared by multiple processes at 
once (with the exception of pipes and their restricted access possibilities). In a parallel 
environment, simultaneous access by multiple processes is the rule rather than the exception. 
Moreover, parallel processes often access the file in an interleaved manner, where each 
process accesses a fragmented subset of the file, while other processes access the parts that 
the first process does not access [8]. Unix file operations provide no support for such access, 
and in particular, do not allow access to multiple non-contiguous parts of the file in a single 
operation. 



2 


Draft Document of the MPI-IO Interface, January 30, 1995 


Parallel file systems and programming environments have typically solved this problem 
by introducing file modes. The different modes specify the semantics of simultaneous opera- 
tions by multiple processes. Once a mode is defined, conventional read and write operations 
a re used to access the data, and their semantics are determined by the mode. The most 
common modes are [10, 7, 6, 1]: 


mode 

description 

examples 

broadcast 

reduce 

all processes collectively 
access the same data 

Express singl 
PFS global mode 
CMMD sync-broadcast 

scatter 

gather 

all processes collectively 
access a sequence of data 
blocks, in rank order 

Express multi 
CFS modes 2 and 3 
PFS sync & record 
CMMD sync-sequential 

shared 

offset 

processes operate independently 
but share a common file pointer 

CFS mode 1 
PFS log mode 

independent 

allows programmer complete 
freedom 

Express async 

CFS mode 0 

PFS Unix mode 

CMMD local & independent 


The common denominator of those modes that actually attempt to capture useful I/O 
patterns and help the programmer is that they define how data is partitioned among the 
processes. Some systems do this explicitly without using modes, and allow the programmer 
to define the partitioning directly. Examples include Vesta [3] and the nCUBE system 
software [4]. Recent studies show that various simple partitioning schemes do indeed account 
for most of observed parallel I/O patterns [8]. MPI-IO also has the goal of supporting such 
common patterns. 

1.2 Design Goals 

The goal of the MPI-IO interface is to provide a widely used standard for describing parallel 
I/O operations within an MPI message-passing application. The interface should establish 
a flexible, portable, and efficient standard for describing independent and collective file I/O 
operations by processes in a parallel application. The MPI-IO interface is intended to be 
submitted as a proposal for an extension of the MPI standard in support of parallel file I/O. 
The need for such an extension arises from three main reasons. First, the MPI standard does 
not cover file I/O. Second, not all parallel machines support the same parallel or concurrent 
file system interface. Finally, the traditional Unix file system interface is ill-suited to parallel 
computing. 

The MPI I/O interface was designed with the following goals: 

1. It was targeted primarily for scientific applications, though it may be useful for other 
applications as well. 

2. MPI-IO favors common usage patterns over obscure ones. It tries to support 90% of 
parallel programs easily at the expense of making things more difficult in the other 
10 %. 
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3. MPI-IO features are intended to correspond to real world requirements, not just arbi- 
trary usage patterns. New features were only added when they were useful for some 
real world need. 

4. MPI-IO allows the programmer to specify high level information about I/O to the 
system rather than low-level system dependent information. 

5. The design favors performance over functionality. 

The following, however, were not goals of MPI-IO: 

1. Support for message passing environments other than MPI. 

2. Compatibility with the UNIX file interface. 

3. Support for transaction processing. 

4. Support for FORTRAN record oriented I/O. 

1.3 History 

This work is an outgrowth of the original proposal from IBM [11], but it is significantly 
different. The main difference is the use of file types to express partitioning in an MPI- 
like style, rather than using special Vesta functions. In addition, file types are now used to 
express various access patterns such as scatter /gather, rather than having explicit functions 
for the different patterns. 

Version 0.2 is the one presented at the Supercomputing ’94 birds-of-a-feather session, 
with new functions and constants prefixed by “MPIO_” rather than “MPI_” to emphasize 
the fact that they are not part of the MPI standard. 

Version 0.3 accounts for comments received as of December 31, 1994. It states more 
precisely what the current MPI-IO proposal covers and what it does not address (yet) (see 
Section 2.5). Error handling is now supported (see Section 9). Permission modes are not 
specified any longer when opening a file (see Section 4.1). Users can now inquire the current 
size of a file (see Section 4.3). The semantics for updating file pointers has been changed 
and is identical for both individual and shared file pointers, and for both blocking and 
nonblocking operations (see Section 7). 

2 Overview of MPI-IO 

Emphasis has been put in keeping MPI-IO as MPI-friendly as possible. When opening a file, 
a communicator is specified to determine which group of tasks can get access to the file in 
subsequent I/O operations. Accesses to a file can be independent (no coordination between 
tasks takes place) or collective (each task of the group associated with the communicator 
must participate to the collective access). MPI derived datatypes are used for expressing the 
data layout in the file as well as the partitioning of the file data among the communicator 
tasks. In addition, each read/write access operates on a number of MPI objects which can 
be of any MPI basic or derived datatypes. 
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2.1 Data Partitioning in MPI-IO 

Instead of defining file access modes in MPI-IO to express the common patterns for access- 
ing a shared file (broadcast, reduction, scatter, gather), we chose another approach which 
consists of expressing the data partitioning via MPI derived datatypes. Compared to a lim- 
ited set of pre-defined access patterns, this approach has the advantage of added flexibility 
and expressiveness. 

MPI derived datatypes are used in MPI to describe how data is laid out in the user’s 
buffer. We extend this use to describe how the data is laid out in the file as well. Thus we 
distinguish between two (potentially different) derived datatypes that are used: the filetype, 
which describes the layout in the file, and the buftype, which describes the layout in the 
user’s buffer. In addition, both filetype and buftype are derived from a third MPI datatype, 
referred to as the elementary datatype etype. The purpose of the elementary datatype is to 
ensure consistency between the type signatures of filetype and buftype. Offsets for accessing 
data within the file are expressed as an integral number of etype items. 

The filetype defines a data pattern that is replicated throughout the file (or part of the 
file — see the concept of displacement below) to tile the file data. It should be noted that 
MPI derived datatypes consist of fields of data that are located at specified offsets. This 
can leave “holes” between the fields, that do not contain any data. In the context of tiling 
the file with the filetype, the task can only access the file data that matches items in the 
filetype. It cannot access file data that falls under holes (see Figure 1). 

etype till 

filetype I t j ~l 

' holes ** 


tiling a file with the filetype: 


nsiii 


□aa 


T 


Illltll 


r iiim 


V J 

accesible dat a ~ 


Figure 1 : Tiling a file using a filetype 

Data which resides in holes can be accessed by other tasks which use complementary 
filetypes (see Figure 2). Thus, file data can be distributed among parallel tasks in dis- 
joint chunks. MPI-IO provides filetype constructors to help the user create complementary 
filetypes for common distribution patterns, such as broadcast/reduce, scatter/gather, and 
HPF distributions (see Section 8). 

etype 

process 1 filetype 
process 2 filetype 
process 3 filetype 



tiling a file with the filetypes: 

n 


Figure 2: Partitioning a file among parallel tasks 
In order to better illustrate these concepts, let us consider a 2-D matrix, stored in row 
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major order in a file, that is to be transposed and partitioned among a group of three tasks 
(see Figure 3). The matrix is to be distributed among the parallel tasks in a row cyclic 
manner. Each task wants to store in its own memory the transposed portion of the matrix 
which is assigned to it. Using appropriate filetypes and buftypes allows the user to perform 
that task very easily. In addition, the elementary datatype allows one to have a very generic 
code that applies to any type of 2-D matrix. The corresponding MPI-IO code example is 
given in Appendix D. 



implementation using etype. filetypes, and buftypes 
etype L.J 


process 1 filetype L.T jjjj .1 1 

prooess 2 filetype 1 

process 3 filetype f 

actual layout in the file: 

nil: l: I 



W— M • « • 


buftype (all processes) |||[ 1%%) |||| ~ ~ |||j ||| | 

Figure 3: Transposing and partitioning a 2-D matrix 

Note that using MPI derived datatypes leads to the possibility of very flexible patterns. 
For example, the filetypes need not distribute the data in rank order. In addition, there can 
be overlaps between the data items that are accessed by different processes. The extreme 
case of full overlap is the broadcast/reduce pattern. 

Using the filetype allows a certain access pattern to be established. But it is conceivable 
that a single pattern would not be suitable for the whole file. The MPI-IO solution is to 
define a displacement from the beginning of the file, and have the access pattern start from 
that displacement. Thus if a file has two segments that need to be accessed in different 
patterns, the displacement for the second pattern will skip over the whole first segment. 
This mechanism is also particularly useful for handling files with some header information at 
the beginning (see Figure 4). Use of file headers could allow the support of heterogeneous 
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environments by storing a “standard” codification of the data representations and data 
types of the file data. 


first tiling 
second tiling 



file structure: 


j header 

BlBigp 

ligf 


Hii 



first displacement second displacement 


Figure 4: Displacements 


2.2 MPI-IO Data Access Functions 

As noted above, we have elected not to define specific calls for the different access patterns. 
However, there are different calls for the different synchronization behaviors which are 
desired, and for different ways to specify the offset in the file. The following table summarizes 
these calls: 


offset 

synchronization 

independent 

collective 

explicit 

offset 

blocking 

(synchronous) 

MPIO_Read 

MPIO.Write 

MPIO_Read_all 

MPIO.Write.all 

nonblocking 

(asynchronous) 

MPIO Jread 
MPIO Jwrite 

MPIO.Iread.all 

MPIO_Iwrite_all 

independent 
file pointer 

blocking 

(synchronous) 

M P 10 .Read .next 
MPIO.Write.next 

M PIO .Read .next .all 
MPIO.Write Jiext.all 

nonblocking 

(asynchronous) 

MPIO Jread -next 
MPIO Jwrite .next 

MPIO Jread .next .all 
MPIO Jwrite_next_all 

shared 
file pointer 

blocking 

(synchronous) 

MPIO _Read_shared 
MPIO-Writejshared 

— 

nonblocking 

(asynchronous) 

MPIO Jread_shared 
MPIO J write_shared 



The independent calls with explicit offsets are described in Section 5, and the col- 
lective ones in Section 6. Independent calls do not imply any coordination among the 
calling processes. On the other hand, collective calls imply that all tasks belonging to the 
communicator associated with the opened file must participate. However, as in MPI, no 
synchronization pattern between those tasks is enforced by the MPI-IO definition. Any re- 
quired synchronization may depend upon a specific implementation. Collective calls can be 
used to achieve certain semantics, as in a scatter-gather operation, but they are also useful 
to advise the system of a set of independent accesses that may be optimized if combined. 

When several independent data accesses involve multiple overlapping data blocks, it 
may be desirable to guarantee the atomicity of each access, as provided by Unix (see Ap- 
pendix C). In this case, it is possible to enable the MPIO.CAUTIOUS access mode for the file. 
Note that the cautious mode does not guarantee atomicity of accesses between two different 
MPI applications accessing the same file data, even if they both specify the MPIO.CAUTIOUS 
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mode. Its effect is limited to the confines of the MPLCOMM.WORLD communicator group 
of the processes that opened the file, typically all the processes in the job. The default 
access mode, referred to as MPIO.RECKLESS mode in MPI-IO, does not guarantee atomicity 
between concurrent accesses of the same file data by two parallel tasks of the same MPI 
application. 


2.3 Offsets and File Pointers 


Part of the problem with the Unix interface when used by multiple processes is that there is 
no atomicity of seek and read/write operations. MPI-IO rectifies this problem by including 
an explicit offset argument in the first set of read and write calls. This offset can be absolute, 
which means that it ignores the file partitioning pattern, or relative, which means that only 
the data accessible by this process is counted, excluding the holes of the filetype associated 
with the task (see Figure 5). In both cases, offsets are expressed as an integral number of 
elementary datatype items. As absolute offsets can point to anywhere in the file, they can 
also point to an item that is unaccessible by this process. In this case, the offset will be 
advanced automatically to the next accessible item. Therefore specifying any offset in a hole 
is functionally equivalent to specifying the offset of the first item after the hole. Absolute 
offsets may be easier to understand if accesses to arbitrary random locations are combined 
with partitioning the file among processes using filetypes. If such random accesses are not 
used, relative offsets are better. If the file is not partitioned, absolute and relative offsets 
are the same. 


etype 

process 1 filetype 
process 2 filetype 


□ 


— m 


process 1 offsets: 


| abs 2 

| abs 7 equrv 


I abs 14 

Jrel 2 

1 to abs 10 

1 — > 


jrel 9 



|s | 



| abs8 


j abs 15 


1 rel 3 


1 rel 5 


m EC I M 


disp ■ 


process 2 offsets: 


Figure 5: Absolute and relative offsets 

It should be noted that the offset is a required argument in the explicit offset functions. 
Processes must maintain their offsets into all files by themselves. A separate set of functions, 
described in Section 7, provide the service of doing the next access where the previous one 
left off. This is especially convenient for sequential access patterns (or partitioned-sequential 
patterns), which are very common in scientific computing [8]. Likewise, shared file pointers 
are also supported. This allows for the creation of a log file with no prior coordination 
among the processes, and also supports self-scheduled reading of data. However, there are 
no collective functions using shared offsets. This issue is discussed in Appendix B. 

2.4 End of File 

Unlike Unix files, the end of file is not absolute and identical for all processes accessing the 
file. It depends on the filetype used to access the file and is defined for a given process 
as the location of the byte following the last elementary datatype item accessible by that 
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process (excluding the holes). It may happen that data is located beyond the end of file for 
a given process. This data is accessible only by other processes. 


2.5 Current Proposal and Future Extensions 

The current proposal is not final and will evolve. Additions to it are definitely required to 
make the interface more complete and robust. 

Currently, the problem of heterogeneity of data representations across machine archi- 
tectures is not addressed. As stated above, filetypes are used to partition file data. Their 
purpose is not to ensure type consistency between file data accessed and user’s buffer data, 
nor are they intended to handle type conversion between file data and user’s buffer data. 
Therefore, file data can be currently considered as untyped data and has no data representa- 
tion associated with it. Research must be carried out in order to come up with a standard 
for storing persistent data in a machine independent format and for encoding in the file 
metadata type information of the file data (a file header could be used as a repository for 
these metadata). 

The error handling mechanism (see Section 9) is currently primitive, built on top of 
the MPI error handling mechanism. Further investigation is required in order to verify if 
this approach is appropriate and robust enough. 

No real support for accessing MPI-IO files from a non MPI-IO application is currently 
provided. Additional functions should enable the transfer of MPI-IO files to other file 
systems, as well as the importation of external files into the MPI-IO environment. However, 
the user can easily provide the import functionality for a given external file system (eg Unix) 
by writing a single process program as follows: 


int 

int 

char 


MPIO.File 
MPI0_off set 
MPI0_Status 


fd; 

nread ; 

buffer [4096] ; 
fh; 

offset ; 
status ; 


fd = openO'source.file" , 0.RD0NLY) ; 

MPI0_0pen(MPI_C0MM_W0RLD, "target.f ile" , MPI0_CREATE| |MPI0_WR0NLY , 

MPI0_0FFSET_ZER0 , MPI.BYTE, MPI.BYTE, MPIO.OFFSET.ABSOLUTE, 

NULL, &fh); 

offset = MPI0_0FFSET_ZER0 ; 

while ((nread = read(fd, buffer, 4096)) != 0) { 

MPI0_Write(fh, offset, buffer, HPI.BYTE, nread, &status); 
offset += nread; 

> 

close(f d) ; 

MPIO_Close(fh) ; 

A very similar program could be written to export an MPI-IO file. 

Let us also stress that nothing currently prevents the user from creating an MPI-IO file 
with a given number of processes and accessing it later with a different number of processes. 
This can be achieved by reopening the file with the appropriate filetypes. 
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The current proposal also lacks availability of status information about MPI-IO files. 
The user currently has no way of inquiring any information about a non opened MPI-IO 
file, nor has (s)he the possibility of inquiring the identity of the file owner, the dates and 
times of the creation/last modification of the file, or the access permissions to the file. 

These issues are only some of the main issues that need to be addressed by a real 
standard for parallel I/O. They will be incorporated into our proposal incrementally, as 
time permits. Our emphasis has been to first define the basic functions a standard for 
parallel I/O should provide to allow concurrent access to shared file data in a user-friendly 
and efficient way. This initial set composing the current interface is designed in such a way 
that extensions to it can be introduced easily and progressively. 

3 Interface Definitions and Conventions 

3.1 Independent vs. Collective 

An independent I/O request is a request which is executed individually by any of the pro- 
cesses within a comunicator group. A collective I/O request is a request which is executed 
by all processes within a communicator group. The completion of an independent call only 
depends on the activity of the calling process. On the other hand, collective calls can (but 
are not required to) return as soon as their participation in the collective operation is com- 
pleted. The completion of the call, however, does not indicate that other processes have 
completed or even started the I/O operation. Thus, a collective call may, or may not, have 
the effect of synchronizing all calling processes. Collective calls may require that all pro- 
cesses, involved in the collective operation, pass the same value for an argument. We will 
indicate it with the “[SAME]” annotation in the function definition, like in the following 
example: 


MPIO-CLOSE(fh) 

IN fh [SAME] Valid file handle (handle) 

Advice to users. It is dangerous to rely on synchronization side-effects of the collective 
I/O operations for program correctness. However, a correct program must be aware 
of the fact that a synchronization may occur. ( End of advice to users.) 

Advice to implementors. While vendors may write optimized collective I/O oper- 
ations, all collective I/O operations can be written entirely using independent I/O 
operations. ( End of advice to implementors.) 

3.2 Blocking vs. Nonblocking 

One can improve performance by overlapping computation and I/O. A blocking I/O call 
will block until the I/O request is completed. A nonblocking I/O call only initiates an 
I/O operation, but does not wait for it to complete. A nonblocking call may return before 
the data has been read/written out of the user’s buffer. A separate request complete call 
(MPLWait or MPl.Test) is needed to complete the I/O request, i.e., to certify that data has 
been read/ written out of the user’s buffer. With suitable hardware, the transfer of data 
out/in the user’s buffer may proceed concurrently with computation. 
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Advice to users. The fact that a blocking or nonblocking I/O request completed does 
not indicate that data has been stored on permanent storage. It only indicates that 
it is safe to access the user’s buffer. ( End of advice to users.) 

3.3 Etype, Filetype, Buftype, and Offset Relation 

The etype argument is the elementary datatype associated with a file, etype is used to 
express the filetype, buftype and offset arguments. The filetype and buftype datatypes must 
be directly constructed (i.e. derived datatype) from etype, or their type signatures must be 
a multiple of the etype signature. Complete flexibility can be achieved by setting etype to 
MPLBYTE. The offset argument used in the read/write interfaces will be expressed in units 
of the elementary datatype etype. 

3.4 Displacement and offset types 

In FORTRAN, displacements and offsets are expressed as 64 bit integers. In case 64 bit 
integers are not supported by a specific machine, this does not preclude the use of MPI-IO, 
but restricts displacements to 2 billion bytes and offsets to 2 billion elementary datatype 
items (substituting INTEGER*8 variables with INTEGER*4 variables). In C, a new type, 
MPIO-Offset, is introduced and can be seen as a long long int, if supported, or as a long int 
otherwise. 

3.5 Return Code and Status 

All the MPI-IO Fortran interfaces return a success or a failure code in the IERR0R return 
argument. All MPI-IO C functions also return a success or a failure code. The success 
return code is MPI_SUCCESS. Failure return codes are implementation dependent. 

If the end of file is reached during a read operation, the error MPIO.ERR.EOF is returned 
(either by the blocking read operation or by the function MPLTest or MPLWait applied to 
the request returned by the nonblocking read operation). The user may write his/her own 
error handler and associate it with the file handle (see Section 9) in order to process this 
error. 

The number of items actually read/ written is stored in the status argument. The 
MPLGet.count or MPLGet .element MPI functions can be used to extract from status (opaque 
object), the actual number of elements read/written either in etype, filetype or buftype units. 

3.6 Interrupts 

Like MPI, MPI-IO should be interrupt safe. In other words, MPI-IO calls suspended by 
the occurrence of a signal should resume and complete after the signal is handled. In case 
the handling of the signal has an impact on the MPI-IO operation taking place, the MPI- 
IO implementation should behave appropriately for that situation and very likely an error 
message should be returned to the user and the relevant error handling take place (see 
Section 9). 
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4 File Control 

4.1 Opening a File (Collective) 


MPIO_OPEN(comm, filename, amode, disp, etype, filetype, mofFset, hints, fh) 


IN 

comm 

[SAME] Communicator that opens the file (handle) 

IN 

filename 

[SAME] Name of file to be opened (string) 

IN 

amode 

[SAME] File access mode (integer) 

IN 

disp 

Absolute displacement (nonnegative offset) 

IN 

etype 

[SAME] Elementary datatype (handle) 

IN 

filetype 

Filetype (handle) 

IN 

moffset 

Relative/ Absolute offset flag (integer) 

IN 

hints 

Hints to the file system (array of integer) 

OUT 

fh 

Returned file handle (handle) 


int MPI0_0pen(MPI_Comm comm, char *filename, MPI0_Mode amode, 

MPIO-Offset disp, MPI_Datatype etype, MPI_Datatype filetype, 
MPIO.Off set jnode moffset, HPI0_Hints *hints, MPI0_File *fh) 

MPI0-0PEN (COMM , FILENAME, AMODE, DISP, ETYPE, FILETYPE, MOFFSET, HINTS, FH, 
IERR0R) 

CHARACTER FILENAMES) 

INTEGER COMM, AMODE, ETYPE, FILETYPE, MOFFSET, 

INTEGER HINTS (MPIO_HINTS_SIZE) , FH, IERROR 
INTEGER* 8 DISP 

MPIO.Open opens the file identified by the file name filename, with the access mode 
amode. 

The following access modes are supported: 

• MPIO-RDONLY - reading only 

• MPIO.RDWR - reading and writing 

• MPICLWRONLY - writing only 

• MPIO-CREATE - creating file 

• MPICLDELETE - deleting on close 

These can be combined using the bitwise OR operator. Note that the Unix append mode is 
not supported. This mode can be emulated by requesting the current file size (see Section 
4.3) and seeking to the end of file before each write operation. 

The disp displacement argument specifies the position (absolute offset in bytes from the 
beginning of the file), where the file is to be opened. This is used to skip headers, and when 
the file includes a sequence of data segments that are to be accessed in different patterns. 
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The etype argument specifies the elementary datatype used to construct the filetype, 
and also the buftype type used in the read/ write. Offsets into the file are measured in units 
of etype. The filetype argument describes what part of the data in the file is being accessed. 
Conceptually, the file starting from disp is tiled by repeated copies of filetype, until the end. 
If filetype has holes in it, then the data in the holes is inaccessible by this process. However, 
the disp, etype and filetype arguments can be changed later to access a different part of the 
file. 

The argument moffset specifies how offset values must be interpreted, moffset can have 
two values: 

• MPIO-OFFSET-ABSOLUTE - as absolute offsets (count holes in filetype) 

• MPIO.OFFSETJ1ELATIVE - as relative offsets (ignore holes in filetype) 

Absolute offsets are interpreted relative to the full extent of the filetype. However, offsets 
that point to a hole in the filetype will actually access the data immediately following the 
hole. Relative offsets are interpreted relative to the accessible data only (ignoring the holes 
in the filetype). 

The hints argument gives user’s file access patterns, and file system specifics (see Ap- 
pendix A). 

Files are opened by default in the MPICLRECKLESS read/write atomic semantics mode. 
Each process may pass different values for the disp, filetype, mofFset and hints arguments. 
However, the filename, comm, amode and etype argument values must be the same. 

The file handle returned, fh, can be subsequently used to access the file. 

Access permissions are not specified when opening a file. If the file is being created, 
operating system defaults apply (eg Unix command umask). 

Advice to users. Each process can open a file independently of other processes by 
using the MPLCOMM_SELF communicator. 

If two different MPI applications open the same file, the behavior and atomicity of 
the file accesses are implementation dependent. The MPIOXAUTIOUS mode enforces 
read/write atomicity in the MPLCOMM.WORLD communicator group only. (End of 
advice to users.) 

4.2 Closing a file (Collective) 

MPICLCLOSE(fh) 

IN fh [SAME] Valid file handle (handle) 

int MPI0_Close(MPI0_File fh) 

MPI0.CL0SE(FH, IERR0R) 

INTEGER FH, I ERROR 

MPIO.CIose closes the file associated with fh. If the file was opened with MPIO.DELETE, 
the file is deleted. If there are other processes currently accessing the file, the status of the 
file and the behavior of future accesses are implementation dependent. After closing, the 
content of the file handle fh is destroyed. All future use of fh will cause an error. 
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Advice to implementors. If the file is to be deleted and is opened by other processes, 
file data may still be accessible by these processes until they close the file or until they 
exit. (End of advice to implementors.) 

4.3 File Control (Independent/Collective) 


MPIO.FILE 

.CONTROL(fh, size, 

cmd, arg) 

IN 

fh 

[SAME] Valid file handle (handle) 

IN 

size 

[SAME] Numbers of command passed (integer) 

IN 

cmd 

[SAME] Command arguments (array of integer) 

IN/OUT 

arg 

Arguments or return values to the command requests 


int MPI0_File_control(MPI0_File fh, int size, int *cmd, void *arg) 

MPIO_FILE_CONTROL(FH, SIZE, CMD , ARG, IERR0R) 

INTEGER FH, SIZE, CMD(*) , IERR0R , ARG(*) 

MPIO-File.Control gets or sets file information about the file associated with the file 
handle fh. Multiple commands can be issued in one call, with the restriction that it is not 
allowed to mix collective and independent commands. The commands available are: 

• (independent) 

— MPIO.GETCOMM: Get the communicator associated with the file. 

— MPIO.GETNAME: Get the filename. 

— MPIO-GETAMODE: Get the file access mode associated with the file. 

— MPIO.GETDISP: Get the displacement. 

— MPIO-GETETYPE: Get the elementary datatype. 

- MPIO.GETFILETYPE: Get the filetype. 

— MPIO.GETHINTS: Get the hints associated with the file. 

- MPIO-GETATOM: Get the current read/write atomic semantics enforced mode. 

— MPIO-GETIIMDIVIDUALPOIIMTER: Get the current offset of the individual file pointer 
associated with the file (number of elementary datatype items within the file after 
the displacement position). 

— MPIO.GETSHAREDPOINTER: Get the current offset of the shared file pointer as- 
sociated with the file (number of elementary datatype items within the file after 
the displacement position). 

• (Collective) 

— MPIO-SETAMODE: Set the file access mode using the arg argument, arg must be 
a valid amode. 

— MPIO.SETDISP: Set new displacement. 

- MPIO-SETETYPE: Set the elementary datatype associated with the file. 
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- MPIO.SETFILETYPE: Set the filetype associated with the file. 

MPIO.SETATOM. Set the read/write atomic semantics enforced mode, arg can be 
either MPIO-RECKLESS or MPIO.CAUTIOUS. 

— MPIO.GETSIZE: Get the current file size. 

For collective commands, all processes in the communicator group that opened the file 
must issue the same command. In the cases of MPIO-SETAMODE and MPIO-SETATOM, the 
arguments must also be identical. 

4.4 Deleting a file (Independent) 


MPIO_DELETE(filename) 

IN filename Name of the file to be deleted (string) 

int MPIO_Delete(char *f ilename) 

MPIO _DELETE(FILENAME, IERROR) 

CHARACTER FILENAME (*) 

INTEGER IERROR 

MPIO_Delete deletes a file. If the file exists it is removed. If there are other processes 
currently accessing the file, the status of the file and the behavior of future accesses are 
implementation dependent. If the file does not exist, MPIO_Delete returns a warning error 
code. 


Advice to implementors. If the file to be deleted is opened by other processes, file 
data may still be accessible by these processes until they close the file or until they 
exit. ( End of advice to implementors.) 

4.5 Resizing a file (Collective) 


MPIO_RESIZE(MPIO_File fh, MPIO.OfFset disp) 

^ [SAME] Valid file handle (handle) 

^' S P [SAME] Displacement which the file is to be truncated 

at or expanded to (nonnegative offset) 

int MPIO -Resize (MPI0_File fh, MPI0_0ffset disp) 

MPI0_RESIZE(FH , DISP, IERROR) 

INTEGER FH, IERROR 
INTEGER*8 DISP 

MPIO.Resize resizes the file associated with the file handle fh. If disp is smaller than 
the current file size, the file is truncated at the position defined by disp (from the beginning 
of the file and measured in bytes). File blocks located beyond that position are deallocated. 
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1 If disp is larger than the current file size, additional file blocks are allocated and the file 

2 size becomes disp. All processes in the communicator group must call MPIO.Resize with 

3 the same displacement. 

4 

5 4.6 File Sync (Collective) 

6 


8 MPIO_FILE_SYNC(fh) 
IN fh 


[SAME]Valid file handle (handle) 


n 

12 int MPIO _File_sync(MPIO_File fh) 

13 MPIO_FILE_SYNC(FH , IERROR) 

14 INTEGER FH, IERROR 

15 

1 6 MPIO -File_sync causes the contents of the file referenced by fh to be flushed to perma- 

17 nent storage. All processes in the communicator group associated with the file handle fh 

is must call MPIO.Filejsync. The MPIO_File-jsync call returns after all processes in the com- 

1 9 municator group have flushed to permanent storage the data they have been accessing since 

20 they opened the file. 

21 

22 Advice to users. MPIO_File_sync guarantees that all completed I/O requests have 

23 been flushed to permanent storage. Pending nonblocking I/O requests that have not 

24 completed are not guaranteed to be flushed. (End of advice to users.) 

25 

26 5 Independent I/O 

27 

28 5.1 MPIO.Read 

29 

30 


1 MPIO-READ(fh, offset, buff, buftype, bufcount, status) 

32 


33 

IN 

fh 

Valid file handle (handle) 

34 

IN 

ofFset 

File offset (nonnegative offset) 

35 

36 

OUT 

bufF 

Initial address of the user’s buffer (integer) 

37 

IN 

buftype 

User’s buffer datatype (handle) 

38 

IN 

bufcount 

Number of buftype elements (integer) 

39 

40 

OUT 

status 

Status information (Status) 

41 

42 

43 

int MPI0_Read (MPIO .File 
MPI_Datatyp 

fh, MPIOJDffset offset, void *buff, 
e buftype, int bufcount, MPI_Status ♦status) 


MPI0_READ(FH, OFFSET, BUFF, BUFTYPE, BUFCOUNT, STATUS, IERROR) 
<type> BUFF(*) 

INTEGER FH, BUFTYPE, BUFCOUNT, STATUS (MP I .STATUS -SIZE) , IERROR 
INTEGER*8 OFFSET 


48 
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MPIO-Read attempts to read from the file associated with fh (at the offset position) 
a total number of bufcount data items having buftype datatype into the user’s buffer buff. 
The data is taken out of those parts of the file specified by filetype. MPIO-Read stores the 
number of buftype elements actually read in status. 

5.2 MPIO-Write 


MPIO_WRITE(fh, offset, buff, buftype, bufcount, status) 


IN 

fh 

Valid file handle (handle) 

IN 

offset 

File offset (nonnegative offset) 

IN 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (integer) 

OUT 

status 

Status information (Status) 


int MPIO-Write (MPI0_File fh, MPI0_0ffset offset, void *buff, 

MPI_Datatype buftype, int bufcount, MPI_Status *status) 

MP I 0 -WRITE ( FH , OFFSET, BUFF, BUFTYPE, BUFCOUNT, STATUS, IERR0R) 

<type> BUFFO) 

INTEGER FH, BUFTYPE, BUFCOUNT, STATUS (MP I .STATUS -S I ZE ) , IERR0R 
INTEGERS OFFSET 

MPIO-Write attempts to write into the file associated with fh (at the offset position) 
a total number of bufcount data items having buftype datatype from the user’s buffer buff. 
The data is written into those parts of the file specified by filetype. MPIO-Write stores the 
number of buftype elements actually written in status. 

5.3 MPIO.Iread 


MPIO JREAD(fh, offset, buff, buftype, bufcount, request) 


IN 

fh 

Valid file handle (handle) 

IN 

offset 

File Offset (nonnegative offset) 

OUT 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

request 

Read request handle (handle) 


int MPI0_Iread(MPI0_File fh, MPI0_Dffset offset, void *buff, 

MPI_Datatype buftype, int bufcount, MPI_Request *request) 

MPI0.IREADCFH, OFFSET, BUFF, BUFTYPE, BUFCOUNT, REQUEST, IERR0R) 
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<type> BUFFO) 

INTEGER FH, BUFTYPE, BUFCOUNT, REQUEST, IERR0R 

INTEGERS OFFSET 

MPIOJread is a nonblocking version of the MPIO.Read interface. MPIOJread associates 
a request handle request with the I/O request. The request handle can be used later to query 
the status of the read request, using the MPI function MPLTest, or wait for its completion, 
using the function MPLWait. 

The nonblocking read call indicates that the system can start to read data into the 
supplied buffer. The user should not access any part of the receiving buffer after a non- 
blocking read is posted, until the read completes (as indicated by MPLTest or MPLWait). 
MPIOJread attempts to read from the file associated with fh (at the ofFset position) a total 
number of bufcount data items having buftype type into the user’s buffer bufF. The number 
of buftype elements actually read can be extracted from the MPLTest or MPLWait return 
status. 

5.4 MPIOJwrite 


MPIOJWRITE(fh, ofFset, bufF, buftype, bufcount, request) 


IN 

fh 

Valid file handle (handle) 

IN 

ofFset 

File Offset (nonnegative offset) 

IN 

bufF 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

request 

Write request handle (handle) 


int MPICLI write (MPIO J r ile fh, MPICLOffset offset, void *buff, 

MPIJ)atatype buftype, int bufcount, MPIJtequest ^request) 

MPICLI WRITE (FH , OFFSET, BUFF, BUFTYPE, BUFCOUNT, REQUEST, IERR0R) 

<type> BUFF(*) 

INTEGER FH, BUFTYPE, BUFCOUNT, REQUEST, IERR0R 
INTEGERS OFFSET 

MPIOJwrite is a nonblocking version of the MPIO-Write interface. MPIOJwrite asso- 
ciates a request handle request with the I/O request. The request handle can be used later 
to query the status of the write request, using the MPI function MPLTest, or wait for its 
completion, using MPLWait. 

The nonblocking write call indicates that the system can start to write data from the 
supplied buffer. The user should not access any part of the buffer after the nonblocking write 
is called, until the write completes (as indicated by MPLTest or MPLWait). MPIOJwrite 
attempts to write into the file associated with fh (at the ofFset position), a total number of 
bufcount data items having buftype type from the user’s buffer bufF. The number of buftype 
elements actually written can be extracted from the MPLTest or MPLWait return status. 
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6 Collective I/O 

6.1 MPIO_Read_all 

MPIO_READ_ALL(fh, offset, buff, buftype, bufcount, status) 


IN 

fh 

[SAME] Valid file handle (handle) 

IN 

offset 

File offset (nonnegative offset) 

OUT 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User ’8 buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

status 

Status information (Status) 


int MPIO_Read_all(MPIO_File fh, MPI0_0ffset offset, void *buff, 

MPI_Datatype buftype, int bufcount, MPI_Status *status) 

MPIO_READ_ALL(FH, OFFSET, BUFF, BUFTYPE, BUFCOUNT, STATUS, IERROR) 

<type> BUFF(*) 

INTEGER FH, BUFTYPE, BUFCOUNT, STATUS (MPI_STATUS_SIZE) , IERROR 
INTEGER* 8 OFFSET 

MPIO_Read_all is a collective version of the blocking MPIO_Read interface. All processes 
in the communicator group associated with the file handle fh must call MPIO_Read_all. Each 
process may pass different argument values for the offset, buftype, and bufcount arguments. 
For each process, MPIO.Read_all attempts to read, from the file associated with fh (at the 
offset position), a total number of bufcount data items having buftype type into the user’s 
buffer buff. MPIO.Read_all stores the number of buftype elements actually read in status. 

6.2 MPIO_Write_all 


MPIO_WRITE_ALL(fh, offset, buff, buftype, bufcount, status) 


IN 

fh 

[SAME] Valid file handle (handle) 

IN 

offset 

File offset (nonnegative offset) 

IN 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

status 

Status information (Status) 


int MPI0_Write_all(MPI0_File fh, MPIO.Offset offset, void *buff, 

MPI_Datatype buftype, int bufcount, MPI_Status *status) 

MPIO_WRITE_ALL(FH , OFFSET, BUFF, BUFTYPE, BUFCOUNT, STATUS, IERROR) 
<type> BUFF(*) 
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INTEGER FH, BUFTYPE, BUFCOUNT, STATUS (MPIJ3TATUSJ5IZE) , IERR0R 
INTEGER+8 OFFSET 

MPIO_Write_all is a collective version of the blocking MPIO-Write interface. All pro- 
cesses in the communicator group associated with the file handle fh must call MPIO_Write_all. 
Each process may pass different argument values for the offset, buftype and bufcount argu- 
ments. For each process, MPIO_Write_all attempts to write, into the file associated with 
fh (at the offset position), a total number of bufcount data items having buftype type. 
MPIO_Write_all stores the number of buftype elements actually written in status. 

6.3 MPIOJread_all 


MPIOJREAD_ALL(fh, offset, buff, buftype, bufcount, request) 


IN 

fh 

[SAME] Valid file handle (handle) 

IN 

offset 

File Offset (nonnegative offset) 

OUT 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

request 

Read request handle (handle) 


int MPIO J[read_all(MPI0 J'ile fh, MPI0_0ffset offset, void *buff, 

MPI_Datatype buftype, int bufcount, MPIJtequest ^request) 

MPI0_IREAD-ALL (FH , OFFSET, BUFF, BUFTYPE, BUFCOUNT, REQUEST, IERR0R) 

<type> BUFFO) 

INTEGER FH, BUFTYPE, BUFCOUNT, REQUEST, I ERROR 
INTEGER*8 OFFSET 

MPIOJread-all is a collective version of the nonblocking MPIOJread interface. All pro- 
cesses in the communicator group associated with the file handle fh must call M PIO_f read_all. 
Each process may pass different argument values for the offset, buftype and bufcount ar- 
guments. For each process in the group, M PIO_lread_all attempts to read, from the file 
associated with fh (at the offset position), a total number of bufcount data items having 
buftype type into the user’s buffer buff. MPIO_lread_all associates an individual request han- 
dle request to the I/O request for each process. The request handle can be used later by a 
process to query the status of its individual read request or wait for its completion. On each 
process, MPIOJread_all completes when the individual request has completed (i.e. a process 
does not have to wait for all other processes to complete). The user should not access any 
part of the receiving buffer after a nonblocking read is called, until the read completes. 
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6.4 MPIO_lwrite_all 

MPIOJWRITE.ALL(fh t offset, buff, buftype, bufcount, request) 


IN 

fh 

[SAME] Valid file handle (handle) 

IN 

offset 

File Offset (nonnegative offset) 

IN 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

request 

Write request handle (handle) 


int MPI0_I write J all(MPI0_File fh, MPI0_0ffset offset, void *buff, 

MPI_Datatype buftype, int bufcount, MPI_Request *request) 

MP 1 0 _I WRITE-ALL (FH , OFFSET, BUFF, BUFTYPE, BUFCOUNT, REQUEST, IERR0R) 

<type> BUFF(*) 

INTEGER FH, BUFTYPE, BUFCOUNT, REQUEST, IERROR 
INTEGER*8 OFFSET 

MPIO Jwrite_all is a collective version of the nonblocking MPIOJwrite interface. All pro- 
cesses in the communicator group associated with the file handle fh must call MPIOJwrite_ell. 
Each process may pass different argument values for the offset, buftype and bufcount argu- 
ments. For each process in the group, MPIO_lwrite.all attempts to write, into the file asso- 
ciated with fh (at the offset position), a total number of bufcount data items having buftype 
type. MPIO_lwrite_all also associates an individual request handle request to the I/O request 
for each process. The request handle can be used later by a process to query the status 
of its individual write request or wait for its completion. On each process, MPIO_lwrite_all 
completes when the individual write request has completed (i.e. a process does not have 
to wait for all other processes to complete). The user should not access any part of the 
supplied buffer after a nonblocking write is called, until the write completes. 

7 File pointers 

7.1 Introduction 

When a file is opened in MPI-IO, the system creates a set of file pointers to keep track of 
the current file position. One is a global file pointer which is shared by all the processes in 
the communicator group. The others are individual file pointers local to each process in the 
communicator group, and can be updated independently. 

All the I/O functions described above in Sections 5 and 6 require an explicit offset to be 
passed as an argument. Those functions do not use the system-maintained file pointers, nor 
do those functions update the system maintained file pointers. In this section we describe 
an alternative set of functions that use the system maintained file pointers. Actually there 
are two sets: one using the individual pointers, and the other using the shared pointer. The 
main difference from the previous function is that an offset argument is not required. In 
order to allow the offset to be set, seek functions are provided. 
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The main semantics issue with system-maintained file pointers is how they are updated 
by I/O operations. In general, each I/O operation leaves the pointer pointing to the next 
data item after the last one that was accessed. This principle applies to both types of 
offsets (MPIO.OFFSETJ\BSOLUTE and MPIO_OFFSET_RELATIVE), to both types of pointers 
(individual and shared), and to all types of I/O operations (read and write, blocking and 
nonblocking). The details, however, may be slightly different. 

When absolute offsets are used, the pointer is left pointing to the next etype after the 
last one that was accessed. This etype may be accessible to the process, or it may not be 
accessible (see the discussion in Section 2). If it is not, then the next I/O operation will 
automatically advance the pointer to the next accessible etype. With relative offsets, only 
accessible etypes are counted. Therefore it is possible to formalize the update procedure as 
follows: 


new .file .position = oldjposition + 


size(buftype) x buf count 
size(etype) 


In all cases (blocking or nonblocking operation, individual or shared file pointer, ab- 
solute or relative offset), the file pointer is updated when the operation is initiated (see 
Appendix B.2 for the reasons behind this design choice), in other words before the access 
is performed. 


Advice to users. This update reflects the amount of data that is requested by the 
access, not the amount that will be actually accessed. Typically, these two values 
will be the same, but they can differ in certain cases (e.g. a read request that reaches 
EOF). This differs from the usual Unix semantics, and the user is encouraged to check 
for EOF occurrence in order to account for the fact that the file pointer may point 
beyond the end of file. In rare cases (e.g. a nonblocking read reaching EOF followed 
by a write), this can cause problems (e.g. creation of holes in the file). (End of advice 
to users.) 


7.2 Shared File Pointer I/O Functions 

These functions use and update the global current file position maintained by the system. 
The individual file pointers are not used nor updated. Note that only independent functions 
are currently defined. It is debatable whether or not collective functions are required as 
well. This issue is addressed in Appendix B.3. 

Advice to users. A shared file pointer only makes sense if all the processes can access 
the same dataset. This means that all the processes should use the same filetype when 
opening the file. ( End of advice to users.) 


48 
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7.2.1 MPIO-Read~shared (independent) 


MPIO_READ_SHARED(fh, buff, buftype, 


IN 

fh 

OUT 

buff 

IN 

buftype 

IN 

bufcount 

OUT 

status 


bufcount, status) 

Valid file handle (handle) 

Initial address of the user’s buffer (integer) 

User’s buffer datatype (handle) 

Number of buftype elements (nonnegative integer) 
Status information (Status) 


int MPIO_Read.shared(MPIO_File fh, void *buff, MPIJJatatype buftype, int 
bufcount, MPI_Status *status) 

MPI 0 JIEAD_SHARED ( FH , BUFF, BUFTYPE, BUFCOUNT, STATUS, I ERROR) 

<type> BUFFO) 

INTEGER FH, BUFTYPE, BUFCOUNT, STATUS (MPI_STATUS_SIZE) , IERR0R 

M PIO_Read_shared has the same semantics as MPIO_Read with offset set to the global 
current position maintained by the system. 

If multiple processes within the communicator group issue MPIO_Read_shared calls, the 
data returned by the MPIO_Read_shared calls will be as if the calls were serialized; that is 
the processes will not have read the same data. The ordering is not deterministic. The user 
needs to use other synchronization means to enforce a specific order. 

After the read operation is initiated, the shared file pointer is updated to point to the 
next data item after the last one requested. 


7.2.2 MPIO_Write_shared (independent) 


MPIO_WRITE_SHARED(fh, buff, buftype, bufcount, status) 


IN 

fh 

Valid file handle (handle) 

IN 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

status 

Status information (Status) 


int MPIO_Write_shared(MPIO_File fh, void *buff, MPI_Datatype buftype, int 
bufcount, MPlJStatus *status) 

MPIO_WRITE_SHARED(FH, BUFF, BUFTYPE, BUFCOUNT, STATUS, IERR0R) 

<type> BUFF(*) 

INTEGER FH, BUFTYPE, BUFCOUNT, STATUS (MPI_STATUS_SIZE) , IERR0R 

MPIO_Write_shared has the same semantics as MPIO_Write with offset set to the global 
current position maintained by the system. 
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If multiple processes within the communicator group issue MPIO.Writejshared calls, the 
data will be written as if the MPIO.Writejshared calls were serialized; that is the processes 
will not overwrite each other’s data. The ordering is not deterministic. The user needs to 
use other synchronization means to enforce a specific order. 

After the write operation is initiated, the current global file pointer is updated to point 
to the next data item after the last one requested. 

7.2.3 M PIO Jread-shared (independent) 


MPIOJREAD_SHARED(fh, buff, buftype, bufcount, request) 


IN 

fh 

Valid file handle (handle) 

OUT 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

request 

Read request handle (handle) 


int MPIO _Iread_shared(MPIOJFile fh, void *buff, MPIJ)atatype buftype, 
int bufcount, MPIJlequest ^request) 

MPIO_IREAD_SHARED (FH , BUFF, BUFTYPE, BUFCOUNT, REQUEST, IERROR) 

<type> BUFF(*) 

INTEGER FH, BUFTYPE, BUFCOUNT, REQUEST, IERROR 

MPIO_lread_shared is a nonblocking version of the MPIO.Read_shared interface. 
MPIOJread-shared associates a request handle request with the I/O request. The request 
handle can be used later to query the status of the read request, using the MPI function 
MPLTest, or wait for its completion, using the function MPLWait. 

If multiple processes within the communicator group issue MPIOJread^shared calls, the 
data returned by the MPIO Jread_shared calls will be as if the calls were serialized; that is 
the processes will not have read the same data. The ordering is not deterministic. The user 
needs to use other synchronization means to enforce a specific order. 

After the read operation is successfully initiated, the shared file pointer is updated to 
point to the next data item after the last one requested. 
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7.2.4 MPIOJwrite-shared (independent) 


MPIO JWRITE_SHARED(fh, buff, buftype, bufcount, request) 


IN 

fh 

Valid file handle (handle) 

IN 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

request 

Write request handle (handle) 


int MPIO_Iwrite_shared(MPIO_File fh, void *buff , MPI_Datatype buftype, 
int bufcount, MPI-Request *request) 

MPIO .IWRITE.SHARED ( FH , BUFF, BUFTYPE, BUFCOUNT, REQUEST, IERROR) 

<type> BUFF(*) 

INTEGER FH, BUFTYPE, BUFCOUNT, REQUEST, IERROR 

MPIOJwrite-shared is a nonblocking version of the MPIO_Write_shared interface. 
MPIOJwritejshared associates a request handle request with the I/O request. The request 
handle can be used later to query the status of the write request, using the MPI function 
MPLTest, or wait for its completion, using MPLWait. 

If multiple processes within the communicator group issue MPIOJwrite_shared calls, the 
data will be written as if the MPIO_lwrite_shared calls were serialized; that is the processes 
will not overwrite each other’s data. The ordering is not deterministic. The user needs to 
use other synchronization means to enforce a specific order. 

After the write operation is successfully initiated, the current global file pointer is 
updated to point to the next data item after the last one requested. 

7.3 Individual File Pointer Blocking I/O Functions 

These functions only use and update the individual current file position maintained by the 
system. They do not use nor update the shared global file pointer. 

In general, these functions have the same semantics as the blocking functions described 
in Sections 5 and 6, with the offset argument set to the current value of the system- 
maintained individual file pointer. This file pointer is updated at the time the I/O is 
initiated and points to the next data item after the last one requested. For collective I/O, 
each individual file pointer is updated independently. 
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1 7.3.1 MPIO_Read_next (independent) 

2 

3 

4 MPIO_READ_NEXT(fH, buff, buftype, bufcount, status) 


5 

6 

IN 

fh 

Valid file handle (handle) 

7 

OUT 

buff 

Initial address of the user’s buffer (integer) 

6 

IN 

buftype 

User’s buffer datatype (handle) 

9 

10 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

11 

OUT 

status 

Status information (Status) 


is int MPIO_Read_next(MPIO_File fh, void *buff, MPI_Datatype buftype, 

44 int bufcount, MPlJStatus ^status) 

** MPIOJtEAD JJEXT(FH , BUFF, BUFTYPE, BUFCOUNT, STATUS, IERROR) 

)' <type> BUFFO) 

! g INTEGER FH, BUFTYPE, BUFCOUNT, STATUS (MPI_STATUS_SIZE) , IERROR 

19 MPIO_Read_next attempts to read from the file associated with fh (at the system main- 

20 tained current file position) a total number of bufcount data items having buftype datatype 

21 into the user’s buffer buff. The data is taken out of those parts of the file specified by 

22 filetype. MPIO_Read.next returns the number of buftype elements read in status. The file 

23 pointer is updated by the amount of data requested. 

24 

25 7.3.2 MPIO_Write_next(independent) 

26 

27 


MPIO_WRITE-NEXT(fh, buff, buftype, bufcount, status) 


30 

IN 

fh 

Valid file handle (handle) 

31 

IN 

buff 

Initial address of the user’s buffer (integer) 

32 

33 

IN 

buftype 

User’s buffer datatype (handle) 

34 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

35 

OUT 

status 

Status information (Status) 


36 


int MPI CLWrit e_next (MPI0_File fh, void *buff, MPIJDatatype buftype, 
8 int bufcount, MPI-Status ^status) 

39 

40 MPIO -WRITE-NEXT (FH , BUFF, BUFTYPE, BUFCOUNT, STATUS, IERROR) 

4 1 <type> BUFF(*) 

42 INTEGER FH, BUFTYPE, BUFCOUNT, STATUS (MPI -STATUS -SIZE) , IERROR 


MPIO_Write_next attempts to write into the file associated with fh (at the system main- 
tained current file position) a total number of bufcount data items having buftype datatype 
from the user’s buffer buff. The data is written into those parts of the file specified by 
filetype. MPICLWritejiext returns the number of buftype elements written in status. The 
file pointer is updated by the amount of data requested. 
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7.3.3 MPIO.Read_next.all (collective) 


MPIO_READ_NEXT_ALL(fh, buff, buftype, bufcount, status) 


IN 

fh 

[SAME] Valid file handle (handle) 

OUT 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

status 

Status information (Status) 


int MPIO_Read_next_all(MPIO_File fh, void *buff, HPI_Datatype buftype, 
int bufcount, MPI_Status *status) 

MPI 0 .READ .NEXT .ALL (FH , BUFF, BUFTYPE, BUFCOUNT, STATUS, IERR0R) 

<type> BUFF(*) 

INTEGER FH, BUFTYPE, BUFCOUNT, STATUS (HPI_STATUSJ5IZE) , IERR0R 

MPIO_Read_next_all is a collective version of the MPIO.Read.next interface. All pro- 
cesses in the communicator group associated with the file handle fh must call M PIO_Read_next_all 
Each process may pass different argument values for the buftype, and bufcount arguments. 

For each process, MPIO.Read_next.all attempts to read, from the file associated with fh (at 
the system maintained current file position), a total number of bufcount data items having 
buftype type into the user’s buffer buff. MPIO_Read_next_all returns the number of buftype 
elements read in status. The file pointer of each process is updated by the amount of data 
requested by that process. 

7.3.4 MPIO_Write_next_all (collective) 


MPIO.WRITE.NEXT _ALL(fh, buff, buftype, bufcount, status) 


IN 

fh 

[SAME] Valid file handle (handle) 

IN 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

status 

Status information (Status) 


int MPIO_Write_next_all(MPIO_File fh, void *buff, MPIJJatatype buftype, 
int bufcount, MPI_Status *status) 

MPI 0 .WRITE .NEXT JILL ( FH , BUFF, BUFTYPE, BUFCOUNT, STATUS, I ERROR) 

<type> BUFF(*) 

INTEGER FH, BUFTYPE, BUFCOUNT, STATUS (MPI .STATUS.SIZE) , I ERROR 

MPIO_Write_next_all is a collective version of the blocking MPIO.Write.next interface. 
All processes in the communicator group associated with the file handle fh must call 
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MPIO_Write_next_all. Each process may pass different argument values for the buftype and 
bufcount arguments. For each process, MPIO_Writejiext_all attempts to write, into the file 
associated with fh (at the system maintained current file position), a total number of buf- 
count data items having buftype type. MPIO_Write_next_all returns the number of buftype 
elements written in status. The file pointer of each process is updated by the amount of 
data requested by that process. 

7.4 Individual File Pointer Nonblocking I/O Functions 

Like the functions described in Section 7.3, these functions only use and update the individ- 
ual current file position maintained by the system. They do not use nor update the shared 
global file pointer. 

In general, these functions have the same semantics as the nonblocking functions de- 
scribed in Sections 5 and 6, with the offset argument set to the current value of the system- 
maintained individual file pointer. This file pointer is updated when the I/O is initiated 
and reflects the amount of data requested. For collective I/O, each individual file pointer 
is updated independently. 

7.4.1 MPIO Jread_next (independent) 


MPIOJREAD_NEXT(fh, bufF, buftype, bufcount, request) 


IN 

fh 

Valid file handle (handle) 

OUT 

bufF 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

request 

Read request handle (handle) 


int MPIO_Iread_next (MPI0_File fh, void *buf f , MPI_Datatype buftype, 
int bufcount, MPIJlequest ^request) 

MPIO_IREAD_NEXT(FH , BUFF, BUFTYPE, BUFCOUNT, REQUEST, IERR0R) 

<type> BUFFO) 

INTEGER FH, BUFTYPE, BUFCOUNT, REQUEST, IERR0R 

MPIOJreadmext is a nonblocking version of the MPIO_Read_next interface. 
MPIOJreadmext associates a request handle request with the I/O request. The request 
handle can be used later to query the status of the read request, using the MPI function 
MPLTest, or wait for its completion, using the function MPLWait. The pointer is updated 
by the amount of data requested. 
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7.4.2 MPIOJwritejiext (independent) 

MPIOJWRITE.N EXT(fh, buff, buftype, bufcount, request) 


IN 

fh 

Valid file handle (handle) 

IN 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

request 

Write request handle (handle) 


int MPIO_Iwrite_next(MPIO_File fh, void *buff, MPI_Datatype buftype, 
int bufcount, MPI_Request * request) 

MPIO_IWRITE_NEXT(FH , BUFF, BUFTYPE, BUFCOUNT, REQUEST, IERR0R) 

<type> BUFF(*) 

INTEGER FH, BUFTYPE, BUFCOUNT, REQUEST, I ERROR 

MPIOJwrite_next is a nonblocking version of the M PIO_Write_next interface. 
MPIO_lwrite_next associates a request handle request with the I/O request. The request 
handle can be used later to query the status of the write request, using the MPI function 
M Pi-Test, or wait for its completion, using MPLWait. The pointer is updated by the amount 
of data requested. 

7.4.3 MPIO_lread_next-all (collective) 


MPIOJREAD-NEXT _ALL(fh, buff, buftype, bufcount, request) 


IN 

fh 

[SAME] Valid file handle (handle) 

OUT 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

request 

Read request handle (handle) 


int MPIO_Ireadjiext_all(MPIO_File fh, void *buff, HPI-Datatype buftype, 
int bufcount, MPI_Request *request) 

MP 1 0_IREAD_NEXT _ALL ( FH , BUFF, BUFTYPE, BUFCOUNT, REQUEST, IERR0R) 

<type> BUFF(*) 

INTEGER FH, BUFTYPE, BUFCOUNT, REQUEST, I ERROR 

MPIO_lread_next_all is a collective version of the nonblocking MPIO_lread_next inter- 
face. All processes in the communicator group associated with the file handle fh must call 
MPIO_lread_next_all. Each process may pass different argument values for the buftype and 
bufcount arguments. For each process in the group, MPIO_lread_next_a!l attempts to read, 
from the file associated with fh (at the system maintained current file position), a total num- 
ber of bufcount data items having buftype type into the user’s buffer buff. MPIO Jread_next.all 
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associates an individual request handle request to the I/O request for each process. The 
request handle can be used later by a process to query the status of its individual read 
request or wait for its completion. On each process, MPIOJreadjnext_all completes when 
the individual request has completed (i.e. a process does not have to wait for all other 
processes to complete). The user should not access any part of the receiving buffer after a 
nonblocking read is called, until the read completes. The pointer is updated by the amount 
of data requested. 

7.4.4 MPIO_lwrite_next_all (collective) 


MPIOJWRITE_NEXT_ALL(fh, buff, buftype, bufcount, request) 


IN 

fh 

[SAME] Valid file handle (handle) 

IN 

buff 

Initial address of the user’s buffer (integer) 

IN 

buftype 

User’s buffer datatype (handle) 

IN 

bufcount 

Number of buftype elements (nonnegative integer) 

OUT 

request 

Write request handle (handle) 


int MPIO_Iwritemext_all(MPIO_File fh, void *buff, MPI_Datatype buftype, 
int bufcount, MPIJlequest ^request) 

MPIO_IWRITE_NEXT-ALL(FH, BUFF, BUFTYPE, BUFCOUNT, REQUEST, IERR0R) 

<type> BUFFO) 

INTEGER FH, BUFTYPE, BUFCOUNT, REQUEST, IERR0R 

MPIO Jwritejiext.all is a collective version of the nonblocking MPIO Jwritemext inter- 
face. All processes in the communicator group associated with the file handle fh must call 
MPIOJwrite_next_ail. Each process may pass different argument values for the buftype and 
bufcount arguments. For each process in the group, MPIOJwrite_next_all attempts to write, 
into the file associated with fh (at the system maintained file position), a total number of 
bufcount data items having buftype type. MPIOJwritemext-all also associates an individual 
request handle request to the I/O request for each process. The request handle can be used 
later by a process to query the status of its individual write request or wait for its com- 
pletion. On each process, MPIOJwrite_next_all completes when the individual write request 
has completed (i.e. a process does not have to wait for all other processes to complete). The 
user should not access any part of the supplied buffer after a nonblocking write is called, 
until the write is completed. The pointer is updated by the amount of data requested. 
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7.5 File Pointer Manipulation Functions 

7.5.1 MPIO_Seek (independent) 


MPIO_SEEK(fh, offset, whence) 


IN 

fh 

Valid file handle (handle) 

IN 

offset 

File offset (offset) 

IN 

whence 

Update mode (integer) 


int MPI0_Seek(MPI0_File fh, MPIOJDffset offset, MPI0_Whence whence) 

MPI0_SEEK(FH, OFFSET, WHENCE) 

INTEGER FH, WHENCE 
INTEGER*8 OFFSET 

MPIO-Seek updates the individual file pointer according to whence, which could have 
the following possible values: 

• MPIO_SEEK_SET: the pointer is set to offset 

• MPIO_SEEK_CUR: the pointer is set to the current file position plus offset 

• MPIO-SEEK.END: the pointer is set to the end of the file plus offset 

The interpretation of offset depends on the value of moffset given when the file was 
opened. If it was MPIO_OFFSET_ABSOLUTE, then offset is relative to the displacement, 
regardless of what the filetype is. If it is MPIO.OFFSET.RELATIVE, then offset is relative 
to the filetype (not counting holes). In either case, it is in units of etype. 

7.5.2 MPIO_Seek_shared (collective) 


MPIO_SEEK_SHARED(fh, offset, whence) 


IN 

fh 

[SAME] Valid file handle (handle) 

IN 

offset 

[SAME] File offset (offset) 

IN 

whence 

[SAME] Update mode (integer) 


int MPIO_Seek_shared(MPIO_File fh, MPIOJDffset offset, MPI0_Whence whence) 

MPIO_SEEK_SHARED (FH , OFFSET, WHENCE) 

INTEGER FH, WHENCE 
INTEGERS OFFSET 

MPIO_Seek_shared updates the global shared file pointer according to whence, which 
could have the following possible values: 

• MPIO-SEEK.SET: the pointer is set to offset 

• MPIO-SEEK.CUR: the pointer is set to the current file position plus offset 
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• MPIO-SEEKJEND: the pointer is set to the end of the file plus offset 

All the processes in the communicator group associated with the file handle fh must 
call MPIO J>eek_shared with the same offset and whence. All processes in the communicator 
group are synchronized with a barrier before the global file pointer is updated. 

The interpretation of offset depends on the value of moffset given when the file was 
opened. If it was MPIO_OFFSET_ABSOLUTE, then offset is relative to the displacement, 
regardless of what the filetype is. If it is MPIO.OFFSET.RELATIVE, then offset is relative 
to the filetype (not counting holes). In either case, it is in units of etype. 

8 Filetype Constructors 

8.1 Introduction 

Common I/O operations (e.g., broadcast read, rank-ordered blocks, etc.) are easily ex- 
pressed in MPI-IO using the previously defined read/ write operations and carefully defined 
filetypes. In order to simplify generation of common filetypes, MPI-IO provides the follow- 
ing MPI datatype constructors. 

Although it is possible to implement these type constructors as local operations, in order 
to facilitate efficient implementations of file I/O operations, all of the filetype constructors 
have been defined to be collective operations. (Recall that a collective operation does not 
imply a barrier synchronization.) 

The set of datatypes created by a single (collective) filetype constructor should be used 
together in collective I/O operations, with identical offsets, and such that the same number 
of etype elements is read/written by each process. 

Advice to users. The user is not required to adhere to this expected usage; however, 
the outcome of such operations, although well-defined, will likely be very confusing. 
(End of advice to users.) 

Each new datatype created newtype consists of zero or more copies of the base type 
oldtype, possibly separated by holes. The extent of the new datatype is a nonnegative 
integer multiple of the extent of the base type. All datatype constructors return a success 
or failure code. 

8.2 Broad cast- Read and Write-Reduce Constructors 
8.2.1 MPIO_Type_read_bcast 


MPIO-TYPE_READ_BCAST(comm r oldtype, newtype) 


IN 

comm 

[SAME] communicator to be used in MPIO_Open (han 
die) 

IN 

oldtype 

[SAME] old datatype (handle) 

OUT 

newtype 

new datatype (handle) 


int MPIO_Type_read_bcast (MPI.Comm comm, MPI .Datatype oldtype, 
MPI_Datatype *newtype) 
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MPIO_TYPE_READ_BCAST(COMM , OLDTYPE , NEWTYPE , I ERROR) 

INTEGER COMM, OLDTYPE, NEWTYPE , IERROR 

MPIO.Type.readJjcast generates a set of new filetypes (one for each member of the 
group) which, when passed to a collective read operation (with identical offsets), will broad- 
cast the same data to all readers. Although semantically equivalent to MPI_Type_contiguous(l, 
oldtype, newtype), a good implementation may be able to optimize the broadcast read op- 
eration by using the types generated by this call. 

8.2.2 MPIO_Type_write_reduce 


MPIO_TYPE_WRITE_REDUCE(comm, oldtype, newtype) 


IN 

comm 

[SAME] communicator to be used in MPIO.Open (han 
die) 

IN 

oldtype 

[SAME] old datatype (handle) 

OUT 

newtype 

new datatype (handle) 


int MPIO_Type_write_reduce(MPI_Comm comm, MPI_Datatype oldtype, 

MPI_Datatype *newtype) 

MPIO_TYPE_WRITE_REDUCE(COMM, OLDTYPE, NEWTYPE, IERROR) 

INTEGER COMM, OLDTYPE, NEWTYPE, IERROR 

MPIO_Type_write_reduce generates a set of new filetypes (one for each member of the 
group) which, when passed to a collective write operation, will result in the data from 
exactly one of the callers being written to the file. A write reduce operation is semantically 
equivalent to passing the type generated by MPI_Type_contiguous(l, oldtype, newtype), to a 
collective write operation (with identical offsets), with MPIOXAUTIOUS mode enabled. A 
good implementation may be able to optimize the write reduce operation by using the types 
generated by this call. 

Advice to implementors. The choice of which process actually performs the write 
operation can either be always the same process (eg process with rank 0 in the process 
group) or arbitrary (eg the first process issuing the call), since no checking of data 
identity is to be performed. ( End of advice to implementors .) 
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8.3 Scatter / Gather Type Constructors 
8.3.1 MPIO_Type_scatter-gather 

MPIO_TYPEJ>CATTER_GATHER(comm, oldtype, newtype) 


IN 

comm 

[SAME] communicator to be used in MPIO-Open (han- 
die) 

IN 

oldtype 

[SAME] old datatype (handle) 

OUT 

newtype 

new datatype (handle) 


int MPIO_Type_scatter_gather(MPI_Comm comm, MPI_Datatype oldtype, 

MPI_Datatype *newtype) 

MPIO JITPE J5CATTER_GATHER(C0MM , OLDTYPE, NEWTYPE, I ERROR) 

INTEGER COMM, OLDTYPE, NEWTYPE, IERROR 

This type allows each process in the group to access a distinct block of the file in rank 
order. The blocks are identical in size and datatype; each is of type oldtype. 

To achieve the scatter or gather operation, the types returned should be passed to a 
collective read or write operation, giving identical offsets. Generated newtypes will not be 
identical, but will have the same extent. 

8.3.2 MPIO-Type_scatterv_gatherv 


MPIO_TYPE_SCATTERV.GATHERV(comm, count, oldtype, newtype) 


IN 

comm 

[SAME] communicator to be used in MPIO-Open (han 
die) 

IN 

count 

number of elements of oldtype in this block (nonneg- 
ative integer) 

IN 

oldtype 

old datatype (handle) 

OUT 

newtype 

new datatype (handle) 


int MPIO_Type_scatterv_gatherv(MPI_Comm comm, int count, 

MPI_Datatype oldtype, MPI_Datatype *newtype) 

MPIO_TYPE_SCATTERV_GATHERV (COMM , COUNT, OLDTYPE, NEWTYPE, IERROR) 

INTEGER COMM, COUNT, OLDTYPE, NEWTYPE, IERROR 

This type allows each process in the group to access a distinct block of the file in rank 
order. The block sizes and types may be different; each block is defined as count repeated 
copies of the passed datatype oldtype (i.e. MPLType_contiguous(count, oldtype, oldtype)). 

To achieve the scatter or gather operation, the types returned should be passed to a 
collective read or write operation, giving identical offsets. 


48 



34 


Draft Document of the MPI-IO Interface, January 30, 1995 


8.4 HPF Filetype Constructors 

The HPF [5] filetype constructors create, for each process in a group, a (possibly different) 
filetype. When used in a collective I/O operation (with identical offsets), this set of filetypes 
defines the particular HPF distribution. 

Each dimension of an array can be distributed in one of three ways: 

• MPIO.HPF.BLOCK - Block distribution 

• MPIO-HPF-CYCLIC - Cyclic distribution 

• MPIO.HPFJMONE - Dimension not distributed 

In order to specify a default distribution argument, the constant MPIO_HPF.DFLT_ARG 
is used. 

For example, ARRAY(CYCLIC(15)) corresponds to MPIO_HPF_CYCLIC with a distribution 
argument of 15, and ARRAY(BLOCK) corresponds to MPIO_HPF_BLOCK with a distribution 
argument of MPIO.HPF.DFLT_ARG. 

8.4.1 MPIO.TypeJipf 

HPF distribution of an N-dimensional array: 


MPIO_TYPE_HPF(comm, ndim, dsize, distrib, darg, oldtype, newtype) 


IN 

comm 

[SAME] communicator to be used in MPICLOpen (han- 
dle) 

IN 

ndim 

[SAME] number of array dimensions (nonnegative in- 
teger) 

IN 

dsize 

[SAME] size of dimension of distributee (array of non- 
negative offset) 

IN 

distrib 

[SAME] HPF distribution of dimension (array of inte- 
ger) 

IN 

darg 

[SAME] distribution argument of dimension, 

e.g. BLOCK(darg), CYCLIC(darg), or MPIO-HPF.NONE 

(array of integer) 

IN 

oldtype 

[SAME] old datatype (handle) 

OUT 

newtype 

new datatype (handle) 


int MPI0_Type_hpf (MPI_Comm comm, int ndim, MPI0_0ffset *dsize, 

MPIO Jltype *distrib, int *darg, MPIJJatatype oldtype, 
MPI_Datatype *newtype) 

MPIO_TYPE_HPF (COMM , NDIM, DSIZE, DISTRIB, DARG, OLDTYPE, NEWTYPE, IERR0R) 
INTEGER COMM, NDIM, DSIZE(*) ,DISTRIB(*) , DARG(*) , OLDTYPE, NEWTYPE, 

I ERROR 

MPIO_Type_hpf generates a filetype corresponding to the HPF distribution of an ndim- 
dimensional array of oldtype specified by the arguments. 

For example, in order to generate the types corresponding to the HPF distribution: 
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<oldtype> FILEARRAY(100, 200, 300) 

MPI_C0MM_SIZE(comm, size, ierror) 

!HPF$ PROCESSORS PROCESSES(size) 

!HPF$ DISTRIBUTE FILEARRAY (CYCLIC ( 10) , *, BLOCK) ONTO PROCESSES 
The corresponding MPI-IO type would be created by the following code: 
ndim = 3; 

dsize [0] = 100; distrib[0] = MPIO.HPF.CYCLIC; darg[0] = 10; 

dsizetl] = 200; distrib[l] = MPI0_HPF_N0NE; darg[l] = 0; 

dsize [2] = 300; distrib[2] = MPI0_HPF_BL0CK; darg[2] = HPIO_HPF_DFLT_ARG ; 

MPIO_Type_hpf (comm, ndim, dsize, distrib, darg, oldtype, ftnewtype) ; 

8.4.2 MPIO_Type_hpf_block 

HPF BLOCK distribution of a one-dimensional array: 

MPIO_TYPE_HPF_BLOCK(comm, dsize, darg, oldtype, newtype) 


IN 

comm 

[SAME] communicator to be used in MPIO_Open (han 
die) 

IN 

dsize 

[SAME] size of distributee (nonnegative offset) 

IN 

darg 

[SAME] distribution argument, e.g. BLOCK(darg) 
(integer) 

IN 

oldtype 

[SAME] old datatype (handle) 

OUT 

newtype 

new datatype (handle) 


int MPIO_Type_hpf .block (MPI_Comm comm, MPIO.Offset dsize, int darg, 
MPI_Datatype oldtype, MPI_Datatype *newtype) 

MPIO_TYPE_HPF_BLOCK(COMM , DSIZE, DARG, OLDTYPE, NEWTYPE, IERROR) 

INTEGER COMM, DSIZE, DARG, OLDTYPE, NEWTYPE, IERROR 

MPIO.Type_hpf_block generates a filetype corresponding to the HPF BLOCK distribution 
of a one-dimensional dsize element array of oldtype. 

This call is a shorthand for: 

distrib = HPF.TYPE.BLOCK ; 

MPIO_Type_hpf (comm, 1, dsize, distrib, darg, oldtype, ftnewtype); 


8.4.3 MPIO.Type.hpf.cyclic 

HPF CYCLIC distribution of a one-dimensional array: 
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MPIO-TYPE_HPF_CYCLIC(comm, dsize, darg, oldtype, newtype) 


IN 

comm 

[SAME] communicator to be used in MPIO.Open (han- 

2 



dle) 

3 

IN 

dsize 

[SAME] size of distributee (nonnegative offset) 

4 

5 

IN 

darg 

[SAME] distribution argument, e.g. CYCLIC(darg) 

6 



(integer) 

7 

IN 

oldtype 

[SAME] old datatype (handle) 

8 

OUT 

newtype 

new datatype (handle) 

9 


10 


int MPI0_Type_hpf .cyclic (HPI_Comm comm, MPIOjOffset dsize, int darg, 

MPI_Datatype oldtype, MPI_Datatype *newtype) 1 

13 

MPI0_TYPE_HPF_CYCLIC(C0HM, DSIZE, DARG, OLDTYPE, NEWTYPE, IERR0R) u 

INTEGER COMM, DSIZE, DARG, OLDTYPE, NEWTYPE, IERR0R) 15 

MPIO_Type_hpf .cyclic generates a filetype corresponding to the HPF CYCLIC distribu- ^ 
tion of a one-dimensional dsize element array of oldtype. 

This call is a shorthand for: ls 

19 

distrib = HPF_TYPE_CYCLIC; 20 

MPI0_Type„hpf (comm, 1, dsize, distrib, darg, oldtype, ftnewtype) ; 21 

22 

8.4.4 MPIO_Type_hpf_2d 23 


HPF distribution of a two - dimensional array: 4 

25 

26 


MPIO-TYPE_HPF_2D(comm, 

dsizel, distribl, dargl, dsize2, distrib2, darg2, oldtype, new- 

27 

type) 


28 

IN 

comm 

[SAME] communicator to be used in MPIO-Open (han- 

29 

IN 


dle) 

30 

31 

dsizel 

[SAME] size of distributee for first dim (nonnegative 

32 



offset) 

33 

IN 

distribl 

[SAME] HPF distribution for first dim (integer) 

34 

IN 

dargl 

[SAME] distribution argument for first dim (integer) 

35 

IN 

dsize2 

[SAME] size of distributee for second dim (nonnega- 

36 

37 



tive offset) 

38 

IN 

distrib2 

[SAME] HPF distribution for second dim (integer) 

39 

IN 

darg2 

[SAME] distribution argument for second dim (inte- 

40 



ger) 

41 

IN 

oldtype 

[SAME] old datatype (handle) 

42 

OUT 

newtype 

new datatype (handle) 

44 


45 

int MPIO.TypeJipf J2d(MPI_Comm comm, MPIOJDffset dsizel, MPI0_Dtype distribl, 46 
int dargl , int dsize2, MPI0J)type distrib2, int darg2, 47 

MPI_Datatype oldtype, MPIJDatatype *newtype) 48 
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MPI0_TYPE_HPF_2D (COMM , DSIZE1, DISTRIB1, DARG1 , DSIZE2, DISTRIB2 , DARG2, 
OLDTYPE, NEWTYPE , IERROR) 

INTEGER COMM, DSIZE1, DISTRIB1, DARG1 , DISIZE2, DISTRIB2, DARG2, 

INTEGER OLDTYPE, NEWTYPE, IERROR 

MPIO_Type_hpf_2d generates a filetype corresponding to the HPF ( distrib 1 (darg 1 ) , dis- 
trib2(darg2)) distribution of a two-dimensional (dsizel,dsize2) element array of oldtype. 
This call is a shorthand for: 

dsize [0] =dsizel ; 
distrib [0] =distribl ; 
darg[0] =dargl ; 
dsize [l]=dsize2 
distrib [1] =distrib2 ; 
darg[l]=darg2; 

MPI0_Type_hpf (comm, 2, dsize, distrib, darg, oldtype, ftnewtype) ; 

9 Error Handling 

The error handling mechanism of MPI-IO is based on that of MPI. Three new error classes, 
called MPIO_ERR_UNRECOVERABlE, MPIO.ERR.RECOVERABLE and MPIO.ERR.EOF are intro- 
duced. They respectively contain all unrecoverable I/O errors, all recoverable I/O errors, 
and the error associated with a read operation beyond the end of file. Each implementation 
will provide the user with a list of supported error codes, and their association with these 
error classes. 

Each file handle has an error handler associated with it when it is created. Three new 
predefined error handlers are defined. MPIO_UNRECOVERABLE_ERRORS-ARE_FATAL consid- 
ers all I/O errors of class MPIOJERRJJNRECOVERABLE as fatal, and ignores all other I/O 
errors. MPIO.ERRORS-RETURN ignores all I/O errors. And MPIO_ERRORSJ\RE_FATAL con- 
siders all I/O errors as fatal. 

Advice to implementors . MPIOJJNRECOVERABLE_ERRORS^ARE_FATAL should be the 
default error handler associated with each file handle at its creation. When a fatal 
error (I/O related or not) occurs, open files should be closed (and optionally deleted if 
they were opened with the MPIO-DELETE access mode), and all I/O buffers should 
be flushed before all executing processes are aborted by the program. However, these 
issues remain implementation dependent. ( End of advice to implementors .) 

New functions allow the user to create (function MPIO_Errhandler_create) new MPI-IO 
error handlers, to associate (function MPIO-Errhandler^set) an error handler with an opened 
file (through its file handle), and to inquire (function MPIO-Errhandler_get) which error 
handler is currently associated with an opened file. 

The attachment of error handlers to file handles is purely local: different processes may 
attach different error handlers to the same file handle. 
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9.1 MPIO_Errhandler.create (independent) 

MPIO_ERRHANDLER_CREATE(function, errhandler) 

IN function User-defined error handling function 

OUT errhandler MPI error handler (handle) 

int MPIO_Errhandler_create(MPIO_Handler Junction function, MPI -Errhandler 
♦errhandler) 

MPIO_ERRHANDLEE_CREATE(FUNCTION , ERRHANDLER, IERR0R) 

EXTERNAL FUNCTION 
INTEGER ERRHANDLER, IERR0R 

MPIO_Errhandler_set registers the user routine function for use as an MPI error handler. 
Returns in errhandler a handle to the registered error handler. 

The user routine should be a C function of type MPIO_Handler_function, which is defined 
as: 

typedef void (MPIO_Handler_f unction) (MPIO.File *, int *, MPI.Datatype *, 

int*, MPI.Status *, int *, ...) 

The first argument is the file handle in use, the second argument is the error code to 
be returned by the MPI routine. The third argument is the buffer datatype associated with 
the current access to the file (the current access to the file is either the current blocking 
access to the file, or the current request MPI.tested or MPI_waited for, associated with a 
nonblocking access to the file). The fourth argument is the number of such buffer datatype 
items requested by the current access to the file. The fifth argument is the status returned 
by the current access to the file. And the sixth argument is the request number associated 
with the current access to the file (this number is relevant for nonblocking accesses only). 
The number of additional arguments and their meanings are implementation dependent. 
Addresses are used for all arguments so that the error handling function can be written in 
FORTRAN. 

9.2 MPIO_Errhandler_set (independent) 

MPIO_ERRHANDLER-SET(fh, errhandler) 

IN fh Valid file handle (handle) 

IN errhandler New MPI error handler for opened file (handle) 

int MPIOJErrhandlerjset (MPI0_File fh, MPI -Errhandler errhandler) 

MPIO_ERRHANDLER_SET(FH , ERRHANDLER, I ERROR) 

INTEGER FH, ERRHANDLER, I ERROR 

MPIO-Errhandler-set associates the new error handler errhandler with the file handle fh 
at the calling process. Note that an error handler is always associated with the file handle. 
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9.3 MPIO_Errhandler_get (independent) 

MPIO_ERRHANDLER.GET(fh, errhandler) 

IN fh Valid file handle (handle) 

OUT errhandler MPI error handler currently associated with file han- 

dle (handle) 

int MPI0_Errhandler.get(MPI0_File fh, MPI .Errhandler *errhandler) 

MPI0_ERRHANDLER_GET (FH , ERRHANDLER, IERR0R) 

INTEGER FH, ERRHANDLER, IERROR 

MPIO_Errhandler_get returns in errhandler the error handler that is currently associated 
with the file handle fh at the calling process. 


48 
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MPI0_TYPE_HPF_2D(C0MM, DSIZE1, DISTRIB1, DARG1, DSIZE2 , DISTRIB2 , DARG2 , 
OLDTYPE , NEWTYPE , I ERROR) 

INTEGER COMM, DSIZE1, DISTRIB1, DARG1 , DISIZE2, DISTRIB2, DARG2 , 

INTEGER OLDTYPE, NEWTYPE, IERROR 

MPIO_Type_hpf_2d generates a filetype corresponding to the HPF (distribl(dargl), dis- 
trib2(darg2)) distribution of a two-dimensional (dsizel,dsize2) element array of oldtype. 
This call is a shorthand for: 

dsize[0] =dsizel ; 
distrib [0] =distribl ; 
darg [0] =dargl ; 
dsize[l]=dsize2 
distrib [1] =distrib2 ; 
darg[l]=darg2; 

MPI0_Type_hpf (comm, 2, dsize, distrib, darg, oldtype, ftnevtype ) ; 

9 Error Handling 

The error handling mechanism of MPI-IO is based on that of MPI. Three new error classes, 
called MPIO-ERR.UN RECOVERABLE, MPIO.ERR.RECOVERABLE and MPIO.ERR_EOF are intro- 
duced. They respectively contain all unrecoverable I/O errors, all recoverable I/O errors, 
and the error associated with a read operation beyond the end of file. Each implementation 
will provide the user with a list of supported error codes, and their association with these 
error classes. 

Each file handle has an error handler associated with it when it is created. Three new 
predefined error handlers are defined. MPIO.UNRECOVERABLE-ERRORS-ARE.FATAL consid- 
ers all I/O errors of class MPIO_ERR_UNRECOVERABLE as fatal, and ignores all other I/O 
errors. MPIO-ERRORS.RETURN ignores all I/O errors. And MPIOJERRORS-ARE.FATAL con- 
siders all I/O errors as fatal. 

Advice to implementors. MPIO-UNRECOVERABLE_ERRORS_ARE_FATAL should be the 
default error handler associated with each file handle at its creation. When a fatal 
error (I/O related or not) occurs, open files should be closed (and optionally deleted if 
they were opened with the MPIO.DELETE access mode), and all I/O buffers should 
be flushed before all executing processes are aborted by the program. However, these 
issues remain implementation dependent. ( End of advice to implementors.) 

New functions allow the user to create (function MPIO_Errhandler_create) new MPI-IO 
error handlers, to associate (function MPIO.Errhandler_set) an error handler with an opened 
file (through its file handle), and to inquire (function MPIO_Errhandler_get) which error 
handler is currently associated with an opened file. 

The attachment of error handlers to file handles is purely local: different processes may 
attach different error handlers to the same file handle. 
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interpret the hints in slightly different ways. For example, the following table outlines 
possible interpretations for an MPI-IO implementation based on the Vesta parallel file 
system: 


hint 

interpretation 

striping- unit 
striping factor 
IO-node-list 
partitioning-pattern 

BSU size 
number of cells 
base node 

Vesta partitioning parameters 


B System Support for File Pointers 

B.l Interface Style 

The basic MPI-IO design calls for offsets to be passed explicitly in each read/write oper- 
ation. This avoids issues of uncertain semantics when multiple processes are performing 
I/O operations in parallel, especially mixed seek and read/write operations. It also reflects 
current practices, where programmers often keep track of offsets themselves, rather than 
using system-maintained offsets. 

There are a number of ways to add support for system-maintained file pointers to the 
interface: 

1. Add a whence argument to each read/write call, to specify whether the given offset is 
to be used directly or whether it is relative to the current system-maintained offset. 
To just use the system-maintained offset, the ofFset argument should be set to 0. 

2. Define certain special values for the ofFset argument. For example, —1 could mean 
that the system maintained individual offset should be used, and -2 that the system- 
maintained shared offset be used. 

3. Define a separate set of functions with no ofFset argument. 

We have chosen the third approach for the following reasons. First, it saves overhead because 
the system need not update offsets unless they are actually used. Second, it makes the 
interface look more like a conventional Unix interface for users who use system-maintained 
offsets. This is preferable over an interface with extra arguments that are not used. 

B.2 File Pointer Update 

In normal Unix I/O operations, the system-maintained file pointer is only updated when 
the operation completes. At that stage, it is known exactly how much data was actually 
accessed (which can be different from the amount requested), and the pointer is updated 
by that amount. 

When MPI-IO nonblocking accesses are made using an individual or the shared file 
pointer, the update cannot be delayed until the operation completes, because additional 
accesses can be initiated before that time by the same process (for both types of file pointers) 
or by other processes (for the shared file pointer). Therefore the file pointer must be updated 
at the outset, by the amount of data requested. 

Similarly, when blocking accesses are made using the shared file pointer, updating the 
file pointer at the completion of each access would have the same effect as serializing all 
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blocking accesses to the file. In order to prevent this, the shared file pointer for blocking 
accesses is updated at the beginning of each access by the amount of data requested. 

For blocking accesses using an individual file pointer, updating the file pointer at the 
completion of each access would be perfectly valid. However, in order to maintain the same 
semantics for all types of accesses using file pointers, the update of the file pointer in this 
case is also made at the beginning of the access by the amount of data requested. 

This way of updating file pointers may lead to some problems in rare circumstances, 
like in the following scenario: 

MPI0_Read_Next (fh, buff, buftype, bufcount, festatus); 

MPIO_Write_Next (fh, buff, buftype, bufcount, ftstatus); 

If the first read reaches EOF, since the file pointer is incremented by the amount of 
data requested, the write will occur beyond EOF, leaving a hole in the file. However, such a 
problem only occurs if reads and writes are mixed with no checking, which is an uncommon 
pattern. 

B.3 Collective Operations with Shared File Pointers 

The current definition of the MPI-IO interface only includes independent read and write 
operations using shared file pointers. Collective calls are not included, because they seem 
to be unnecessary. The main use of a shared pointer is to partition data among processes 
on the fly, with no prior coordination. Collective operations imply coordinated access by 
all the processes. These two approaches seem at odds with each other. 

C Unix Read/Write Atomic Semantics 

The Unix file system read/write interfaces provide atomic access to files. For example, 
suppose process A writes a 64K block starting at offset 0, and process B writes a 32K block 
starting at offset 32K (see Figure 6). With no synchronization, the resulting file will have 
the 32K overlapping block (starting from offset 32K), either come from process A, or from 
process B. The overlapping block will not be intermixed with data from both processes A 
and B. 

Similarly, if process A writes a 64K block starting at offset 0, and process B reads a 
64K block starting at offset 32K, Process B will read the overlapping block, as either old 
data, or as new data written by process A, but not mixed data. When files are declustered 
on multiple storage servers, similar read/ write atomicities need to be guaranteed. All data 
of a single read that spans multiple parallel storage servers must be read entirely before 
or after all data of a write to the same data has proceeded. A simple and inefficient 
solution to enforce this semantics is to serialize all overlapped I/O. Actually, it is worse 
than that, all I/O would need to be synchronized, and checked for overlap before they 
could proceed. However, more efficient techniques are available to ensure correct ordering 
of parallel point sourced reads and writes without resorting to full blown synchronization 
and locking protocols. Some parallel file systems, like IBM Vesta [2], provide support to 
implement such checking. If it is known, that no overlapping I/O operations will occur, 
or the application is only reading the file, I/O can proceed in a reckless mode (i.e. no 
checking). Reckless mode is the default mode when opening a file in MPI-IO. This implies 
that users are responsible for writing correct programs (i.e. non-overlapping I/O requests 
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Process A 
Process B 


OR 

NOT 



Figure 6: Unix Atomic Semantics 

or read only). MPI-IO also supports a cautious mode, that enforces read/write atomic 
semantics. Be aware that this mode may lead to lower performance. 
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D Filetype Constructors: Sample Implementations and Examples 

D.l Support Routines 


/* 

* MPI0_Type_set. bounds - surround a type with holes (increasing the extent) 
*/ 


int MPIO_Type_set_bounds ( 
int displacement, 
int ub, 

MPI.Datatype oldtype, 
MPI. Datatype *newtype) 


/* Displacement from the lower bound */ 
/* Set the upper bound */ 
/* Old datatype */ 
/* New datatype */ 


int blocklength[3] ; 
MPI.Datatype type [3] ; 
MPI.Aint disp[3]; 


blocklength[0] = 1; 
disp[0] = 0; 

type [0] = MPI.LB; 
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blocklength[l] = 1; 
disp[l] = displacement; 
type[l] = oldtype; 
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blocklength[2] = 1; 
disp[2] = ub; 
type [2] = MPI.UB; 


48 
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1 /* 

2 * newtype = 

3 * { (LB, 0), (oldtype, displacement), (UB, ub) > 

4 */ 

5 return MPI_Type_struct(3, blocklength, disp, type, newtype); 

6 } 

7 

* D.2 Sample Filetype Constructor Implementations 

9 

D.2.1 MPIO_Type_scatter_gather Sample Implementation 


11 /* 

12 * MPIO_Type_scatter_gather - generate scatter/gather datatype to access data 

13 * block in rank order. Blocks are identical in size 

14 * and datatype. 

15 */ 

16 int MPIO_Type_scatter_gather( 


17 

MPI_Comm comm, 

/* Communicator group 

*/ 

18 

MPI.Datatype oldtype, 

/* Block datatype 

*/ 

19 

MPI. Datatype ♦newtype) 

/* New datatype 

*/ 


20 { 


int size, rank; 
int extent; 

MPI_Type_extent (oldtype, ^extent) ; 


MPI_Comm_size(comnL, &size) 
MPI _Comm_rank ( comm , ftrank) 


29 return MPIO_Type_set_bounds (rank*extent , size*extent, oldtype, newtype); 

30 } 

31 


32 D.2. 2 HPF BLOCK Sample Implementation 

33 


35 

36 


* MPI0_Type_hpf .block - generate datatypes for a HPF BLOCK (darg) distribution 
*/ 


37 

int 

MPIO_Type_hpf .block ( 

38 


MPI.Comm comm. 

39 


int dsize. 

40 


int darg. 

41 


MPI_Datatype oldtype, 

42 


MPI.Datatype *newtype) 

43 



44 


int size, rank; 

45 


int extent; 

46 


int bef oreblocksize ; 

47 


int myblocksize; 

48 


int nblocks; 


/* Communicator group */ 
/* Size of distributee */ 
/* Distribution argument */ 
/* Old datatype */ 
/* New datatype */ 
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int is_partial_block; 

MPI_Datatype blockl; 
int rc; 

MPI_Comm_size(comm, ftsize) ; 

MPI_Comm_rank(comm, ftrank) ; 

MPI_Type_extent (oldtype, ftextent) ; 

/* 

* Compute and check distribution argument 
*/ 

if (darg == MPIO_HPF_DFLT_ARG) /* [HPF, p. 27, L37] */ 
darg = (dsize + size - 1) / size; 
if (darg * size < dsize) /* [HPF, p. 27, L33] */ 

return MPI0_ERR0R_ARG; 


/* 

* Compute the sum of the sizes of the blocks of all processes 

* ranked before me, and the size of my block 
*/ 

nblocks = dsize / darg; 
is_partial_block = (dsize '/. darg != 0); 
if (nblocks < rank) { 

beforeblocksize = dsize; 
myblocksize = 0; 

}■ else if (nblocks == rank) { 

beforeblocksize = nblocks * darg; 
myblocksize = dsize '/, darg; 

y else { 

beforeblocksize = rank * darg; 
myblocksize = darg; 

> 

/* 

* Create filetype block with holes on either side 

*/ 

if ((rc = MPI_Type_contiguous (myblocksize, oldtype, ftblockl)) 

== MPI.SUCCESS) { 

rc = MPIO_Type_set_bounds(beforeblocksize*extent, dsize* extent , 

blockl, newtype)); 

MPI_Type_f ree(&blockl) ; 

> 

return rc ; 
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D.2.3 HPF CYCLIC Sample Implementation 
/* 

* MPI0_Type_hpf .cyclic - generate types for HPF CYCLIC(darg) distribution; 

* we assume here that dsize >= darg * size; in other 

* words, we do not support degenerated cases where 

* some processes may not have any data assigned to them 
*/ 

int MPIO.Type.hpf .cyclic ( 


MPI. Comm comm. 

/* Communicator group 

*/ 

int dsize, 

/* Distributee size 

*/ 

int darg. 

/* Distribution argument 

*/ 

MPI. Datatype oldtype. 

/* Old datatype 

*/ 

MPI. Datatype *newtype) 

/* New datatype 

*/ 


{ 

int size, rank; 
int extent; 

MPI. Datatype blockl, block2, block3; 
int rc ; 

MPI.Comm_size(comm, &size) ; 

MPI. Comm.rank (comm, ftrank) ; 

MPI. Type.extent (oldtype, feextent) ; 

/* 

* Compute and check distribution argument 
*/ 

if (darg == MPIO.HPF.DFLT.ARG) /* [HPF, p. 27, L42] */ 
darg = 1; 


/* 

* Take care of full blocks (contains darg*size oldtype items) 

*/ 

nelem - dsize / (darg * size) ; 

if ((rc = MPI.Type.contiguous(darg, oldtype, fcblockl) != MPI.SUCCESS) 
return rc ; 

if ((rc = MPIO_Type.set.bounds(darg*rank*extent, darg*size*extent , 

blockl, &block2) ) != MPI.SUCCESS) { 

MPI_Type_free(&blockl) ; 
return rc ; 

> 

rc = MPI.Type.contiguous (nelem, block2, &block3) ; 

MPI.Type.f ree(&blockl) ; 

MPI.Type.f ree(&block2) ; 
if (rc ! = MPI.SUCCESS) 
return rc; 
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/* 

* Take care of residual block 
*/ 

residue = dsize - nelem * (darg * size) ; 
if (residue > rank * darg) { 
int last.block; 
int b [2] ; 

MPI.Aint d[2] ; 

MPI_Datatype t [2] ; 

MPI.Datatype block4, block5; 

last.block = residue - rank * darg; 
if (last_block > darg) 
last_block = darg; 

if ((rc = MPI_Type_contiguous(last_block, oldtype, &block4)) 

!= MP I .SUCCESS) { 

MPI_Type_free(ftblock3) ; 
return rc; 

> 

if ((rc = MPIO_Type_set_bounds(darg*rank*extent , residue*extent , 

block4 , &block5)) != MPI.SUCCESS) { 

MPI_Type_free(&block3) ; 

MPI_Type_free(&block4) ; 
return rc; 

> 

b[0] = 1; 
b Cl) = 1; 

d[0] = 0; 

d[l] = nelem * darg * size * extent; 
t[0] = block3 ; 
t[l] = block5; 

rc = MPI.Type.struct (2, b, d, t, newtype); 

MPI.Type.f ree(&block4) ; 

MPI_Type_free(&block5) ; 

)• else ■( 

rc = MPIO.Type.set .bounds (0 , dsize*extent , block3, newtype); 

> 

MPI_Type_free(&block3) ; 
return rc; 


D.3 Example: Row block distribution of A[100, 100] 

Consider an application (such as one generating visualization data) which saves a timestep 
of a 2-dimensional array A[100] [100] in standard C-order to a file. Say we have 10 nodes. 
The array A is distributed among the nodes in a simple row block decomposition. 

The array is distributed to nodes as (each number represents a 10x10 block): 

0000000000 
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i 


9 


1111111111 

2222222222 

3333333333 

4444444444 

5555555555 

6666666666 

7777777777 

8888888888 

9999999999 


10 

11 


in other words: 


Node 0: 

A[0, 0], A[0, 
A[l. 0], A [1 , 
A [2, 0], A [2, 


1], A[0, 2], .. 
1], A [1 , 2], .. 
1], A [2 , 2], .. 


17 

18 


A [9, 0], A [9, 1], A [9 , 2], .. 


19 


23 


Node 1: 

A[10, 0] , A[10, 1] , A[10, 2] , 
A [11 , 0], A[ll, 1], A[ll , 2], 
A [12, 0] , A[12, 1] , A[12, 2] , 


24 

25 


A[19, 0], 


A[19, 1], A[19, 2], 


28 


32 


Node 9: 

A [90 , 0] , A [90 , 
A[91 , 0], A [91 , 
A [92, 0], A [92, 


1], A [90, 2], 
1], A[91, 2], 
1] , A [92, 2] , 


33 

34 


A [99, 0], 


A [99, 1] , A [99, 2] , 


35 

36 D.3.1 Intel CFS Implementation 


The CFS code might look like: 

38 


double myA[10] [100] ; 
int fd; 


42 

43 

44 

45 

46 

47 


fd = open(f ilename , 0.WR0NLY, 
setiomode(fd, M.REC0RD) ; 

/* Compute new value of myA * 


A[0, 99], 
• , A [1 , 99], 
• , A [2 , 99], 

• , A [9, 99] 


. . . , A[10, 99] , 
. ... A [11, 99], 
. ... A[12, 99], 

. . . , A[19, 99] 


.... A [90 , 99] , 
.... A [91, 99], 
. . . , A [92, 99] , 

. . . , A [99, 99] 


0644) ; 
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write (fd, &myA[0] [0] , sizeof (myA)) ; 
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D.3.2 MPI-IO Implementation j 

The equivalent MPI-IO code would be: 2 

3 

double myA [10] [100] ; . * 

MPIO.Offset disp = MPI0.0FFSET.ZER0; 5 

MPIO.Offset offset; 6 

MPI.Datatype myA.t, myA.ftype; ^ 

MPIO.File fh; » 

HPI_Status status; 9 

char f ilename[255] ; 10 

11 

MPI_Type_contiguous(1000, HP I .DOUBLE, ftmyA. .t) ; 12 

MPIO_Type_scatter_gather(MPI_COMM_ WORLD, myA.t, ftmyA.ftype) ; 13 

MPI_Type_commit(ftmyA_t) ; »« 

MPI_Type_commit(ftmyA_ftype) ; is 

MPI0.0pen(MPI.C0MM_W0RLD , filename, MPI0.WR0NLY, is 

disp, MPI.DOUBLE, myA.ftype, HPIO.OFFSET.RELATIVE , 0, &f h) ; 17 

18 

/* Compute new value of myA */ 19 

20 

offset * disp; 21 

MPICLWrite_all (fh, offset, &myA[0][0], myA.t, 1, ftstatus); 22 

23 

D.4 Example: Column block distribution of A[100, 100] 24 

25 

Again, consider an application which saves a timestep of a 2-dimensional array A[100][100] 26 

in standard C-order to a file, run on 10 nodes. For this example, the array A is distributed 2 r 

among the nodes in a simple column block decomposition. 28 

The array is distributed to nodes as (each number represents a 10x10 block): 29 

0123456789 30 

0123456789 31 

0123456789 32 

0123456789 33 

0123456789 34 

0123456789 35 

0123456789 36 

0 1 2 3 4 5 6 7 8 9 37 

0123456789 38 

0123456789 39 

40 

D.4.1 Intel CFS Implementation 

42 

The CFS code might look like: « 

44 

double myA [100] [10] ; 46 

int fd; 46 

int i; 47 

48 
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fd = open (filename, 0_WR0NLY, 0644); 
setiomode(fd, H_ RECORD ) ; 

/* Compute new value of myA */ 

for (i = 0; i < 100; i++) 

write(fd, &myA[i][0], sizeof (myA)/100) ; 

D.4.2 MPI-IO Implementation 

The equivalent MPI-IO code would be: 

double myA [100] [10] ; 

MPIO.Offset disp = MPI0_0FFSET_ZER0; 

MPI0_0ffset offset; 

MPI.Datatype subrow_t, row_t, myA.ftype; 

HPIO.File fh; 

HPI.Status status; 
char f ilename[255] ; 

MPI_Type_contiguous(10, MPI_D0UBLE, &subrow_t) ; 
MPI0_Type_scatter_gather(MPI_C0HM_W0RLD , subrow_t, ftrow.t) ; 
MPI_Type_contiguous(100, row_t, &myA_ftype) ; 
MPI_Type_commit(&myA_ftype) ; 

MPI_Type_free(&subrow_t) ; 

MPI_Type_free(&row_t) ; 

MPI0_0pen(MPI_C0MM_W0RLD , filename, MPI0.WR0NLY, 

disp, MPI_DOUBLE, myA.ftype, MPIO_OFFSET_RELATIVE , 0, &fh) ; 

/* Compute new value of A */ 

offset = disp; 

MPI0_Write_all(fh, offset, ftmyA[0][0], HPI.DOUBLE, 1000, ftstatus); 

D.5 Example: Transposing a 2-D Matrix in a Row-Cyclic Distribution 

The following code implements the example depicted in Figure 3 in Section 2. A 2-D matrix 
is to be transposed in a row-cyclic distribution onto m processes. For the purpose of this 
example, we assume that matrix A is a square matrix of size n and that each element of the 
matrix is a double precision real number (etype is a MPLDOUBLE). 


int 

m; 

/* number of tasks in MPI_C0MM_W0RLD */ 

int 

rank; 

/* rank of the task within MPI_C0MM_ WORLD */ 

void 

*Aloc; 

/* local matrix assigned to the task */ 

int 

n; 

/* size (in etype) of global matrix A */ 

int 

nrov; 

/* number of rows assigned to the task */ 

int 

sizeof Aloe; 

/* size (in bytes) of local matrix Aloe */ 
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chax 


mat_A[10] - "file. A"; /* name of the file containing matrix A 

/* the file is assumed to exist 


MPIO. Off set disp - MPI0.0FFSET.ZER0 ; /* file.A is supposed to have no header 


MPIO.Mode 

amode; 

/* 

MPI. Datatype 

etype; 

i* 

MPI. Datatype 

filetype; 

/* 



/* 

int 

moffset ; 

/* 

MPIO. Hints 

♦hints ; 

/* 

MPIO.File 

fh; 

/* 

MPIO. Off set 

offset ; 

/* 

MPI. Datatype 

buftype; 

i* 



/* 

int 

buf count ; 

/* 

MPI. Status 

status ; 

/* 

/* temporary variables */ 


int 

sizeof etype ; 

MPI.Datatype 

column.t ; 



*/ 


MPI .Comm. size (MPI.COMM.WORLD, m) ; 

MPI.Comm.rank (MPI.COMM.WORLD, rank); 

/* Determine number of rows assigned to the task */ 

nrow = n / m; 

if (rank < n */, m) nrow++; 

amode = MPIO.RDONLY ; 

/* Aloe is a matrix of MPI. DOUBLE items */ 
etype = MPI.DOUBLE; 

MPI .Type. extent (etype, fesizeof etype) ; 

MPI O.Type.hpf .cyclic (MPI.COMM.WORLD, n * n, n, etype, &f iletype) ; 
MPI.Type.commit (&f iletype) ; 

moffset = MPI0.0FFSET. RELATIVE; /* relative offsets will be used */ 

hints = NULL; /* hints are not fully implemented yet */ 

/* Open file containing matrix A */ 

MPIO.Open (MPI.COMM.WORLD, mat.A, amode, disp, etype, 
filetype, moffset, hints, &fh) ; 

/* Define buffer type that transposes each row of the matrix read in and */ 
/* concatenates the resulting columns */ 


*/ 

*/ 

3 

*/ 

5 

6 
r 

8 

9 

10 
11 
12 
13 

*/ 4 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 
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MPI_Type_vector (n, 1, nrow, etype, ftcolumn.t) ; 

MPI_Type_hvector (nrow, 1, sizeofetype, column.t, ftbuftype) ; 

MPI_Type_commit (&buf type) ; 

MPI_Type_free (&column_t) ; 

/* Allocate memory for local matrix Aloe */ 

MPI_Type_extent (buftype, ftsizeof Aloe) ; 

Aloe = (void *) malloc (sizeof Aloe) ; 

/* Read in local matrix Aloe */ 
offset = disp; 
buf count = 1 ; 

MPI0_Read (fh, offset, Aloe, buftype, bufeount, ^status) ; 

E Justifying Design Decisions 

This section contains a haphazard collection of arguments for other designs and against the 
one we chose, with explanations of why they were rejected. 


Argument: Filetype should be defined in the read/write operation, not in the open call. 
This is similar to having the sendtype and reevtype in MPI scatter/gather calls. 

Answer: This is more cumbersome, especially since it is expected that filetypes will not be 
changed often (if at all). Also, the filetype may be much larger than the buftype (or much 
smaller), which makes it harder to understand how they are aligned. The MPI case does 
not have this problem because the sizes must match. 


Argument: Absolute offsets are confusing, no good, and nobody uses them. 
Answer: OK, we’ll have relative offsets too. 


Argument: Relative offsets are confusing, no good, and nobody uses them. 
Answer: OK, we’ll have absolute offsets too. 


Argument: MPI-like functions with informative names should be used, e.g. Read-Broadcast, 
Write.Single, Read_Scatter, Write.Gather. 

Answer: This causes confusion if the filetype is used as well, because the same effect can 
be achieved in very different ways. The reason to prefer the filetype approach over the 
specific-functions approach is that it is more flexible and provides a mechanism to express 
additional new access patterns. 




